Running OpenClaw on a Raspberry Pi Kubernetes Cluster

How I turned five Raspberry Pis into a self hosted AI gateway with automated backups, full observability, and Telegram alerts.

OpenClaw (originally Clawdbot, then briefly Moltbot) is a viral, open-source autonomous AI assistant designed to execute complex tasks on your behalf across various digital platforms. Created by developer Peter Steinberger, it gives you a unified interface to multiple LLM providers. Think of it as a single front door to Claude, GPT, Gemini, DeepSeek, and whatever model you fancy, all behind one API to provide Autonomous Execution. I wanted to self host it. More importantly, I wanted to learn Kubernetes the hard way. Not the “follow a tutorial on EKS” way. The “Use five Raspberry Pis, stack them on your desk, and figure it out” way.

I could have bought a Mac mini or spun up a VM on AWS, installed OpenClaw, and been done in an afternoon. But where’s the fun in that?

This is the story of how I went from bare metal to an AI gateway running on a K3s cluster, complete with persistent storage, Prometheus monitoring, Grafana dashboards, Telegram alerting, automated backups, and a custom Docker image that cut gateway startup time from three minutes to under ten seconds.

The Hardware

Five Raspberry Pi 4 boards, each with 8GB RAM (7.6 GiB usable), running Debian Trixie 64 bit ARM64 with kernel 6.12.62+rpt-rpi-v8. They sit on my local network (I created a VLAN for openClaw), each with a 128GB SD card. One USB drive attached to the control plane node for backups. That’s the entire bill of materials.

Node Role
pi-1 Control plane, NFS server, USB backup drive
pi-2 Worker: OpenClaw gateway + Redis
pi-3 Worker: general workloads
pi-4 Worker: general workloads
pi-5 Worker: Prometheus, Grafana, Alertmanager

Total cluster resources: 20 ARM64 cores, approximately 38 GiB usable RAM, around 500GB combined storage. More than enough for an AI gateway that proxies API calls rather than running inference locally. These little boards are not doing the thinking. They are directing traffic to models that do.

Step 1: Installing K3s

I went with K3s over full Kubernetes because it is purpose built for ARM and resource constrained environments. The entire control plane binary is under 100MB, and it ships with everything you need: containerd, CoreDNS, Traefik (which I replaced with nginx), and a local path provisioner for storage.

Control Plane (pi-1)

curl -sfL https://get.k3s.io | sh -s - server \
  --disable traefik \
  --write-kubeconfig-mode 644

I disabled Traefik because I prefer nginx ingress for its configurability. The kubeconfig lives at /etc/rancher/k3s/k3s.yaml.

Worker Nodes (pi-2 through pi-5)

First, grab the node token from the control plane:

cat /var/lib/rancher/k3s/server/node-token

Then on each worker:

curl -sfL https://get.k3s.io | K3S_URL=https://<CONTROL_PLANE_IP>:6443 \
  K3S_TOKEN=<NODE_TOKEN> sh -

Within a few minutes, kubectl get nodes shows all five nodes ready, all running K3s v1.34.4+k3s1 with containerd://2.1.5-k3s1. There is something deeply satisfying about watching five nodes come online one after another.

Step 2: Labelling Nodes for Workload Pinning

I did not want workloads drifting around the cluster. OpenClaw uses persistent volumes bound to specific nodes, and I wanted monitoring isolated from application traffic. Two labels sorted this out:

kubectl label node pi-2 openclaw-role=gateway
kubectl label node pi-5 openclaw-role=monitoring

Every deployment uses nodeSelector to pin to the right node. This keeps things predictable. When something breaks at 2am (it will), you always know exactly where to look.

Step 3: Networking with MetalLB and Nginx Ingress

On a bare metal cluster, there is no cloud load balancer handing out external IPs. MetalLB fills that gap.

kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.14.9/config/manifests/metallb-native.yaml

I allocated a small pool from my home network:

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: default-pool
  namespace: metallb-system
spec:
  addresses:
    - 192.168.1.230-192.168.1.240

The OpenClaw gateway got one IP, nginx ingress got another. On my Mac, I pointed the gateway’s hostname at the ingress IP in /etc/hosts, so I can reach the web UI from my browser. Low tech, effective.

For TLS, cert-manager handles Let’s Encrypt certificates. The gateway serves plain HTTP, and nginx terminates TLS at the ingress layer.

Step 4: Storage — Keep It Simple

I initially deployed Longhorn for distributed, replicated storage across the cluster. Impressive technology. On Raspberry Pis, it was overkill. The replication overhead consumed significant CPU and memory on nodes that had better things to do, and debugging storage issues on ARM64 was painful enough to make me question my life choices.

I ripped it all out and switched to K3s’s built in local-path provisioner. One PVC, one node, one directory on the SD card. No replication, no distributed consensus, no iSCSI.

storageClassName: local-path
accessModes:
  - ReadWriteOnce
resources:
  requests:
    storage: 5Gi

The tradeoff is obvious. If pi-2’s SD card dies, I lose the data on that node. But that is what backups are for, and I would rather have a cluster that runs smoothly 99.9% of the time than one that replicates storage at the cost of constant resource pressure.

OpenClaw uses three PVCs, all pinned to pi-2:

openclaw-config (5Gi): Mounted at /root/.openclaw. Holds the gateway config, credentials, agent profiles, and device pairing state.

openclaw-workspace (10Gi): Mounted at /root/openclaw/workspace. The agent’s working directory.

redis-data-redis-0 (2Gi): Redis persistence for session state.

Hard lesson learned: never use emptyDir for the .openclaw directory. I did this during early testing. The pod restarted. All my workspace files, agent configuration, and pairing state vanished. Unrecoverable. PVCs are non negotiable for anything that matters.

Step 5: Building a Custom Gateway Image

Out of the box, OpenClaw runs on Node.js. You can npm install -g openclaw@beta and start the gateway. But doing that on every pod restart means a three minute startup time while npm downloads packages over the Pi’s network connection. On a cluster where pods restart for rolling updates, OOM kills, or node reboots, that is a startup tax you will pay forever.

So I built a custom Docker image:

FROM node:22-bookworm

# Install system packages in a single layer
RUN apt-get update && apt-get install -y --no-install-recommends \
    python3-venv python3-pip jq ripgrep ffmpeg tmux rsync \
    buildah net-tools iputils-ping dnsutils vim-tiny htop \
    strace lsof procps iproute2 \
    && rm -rf /var/lib/apt/lists/*

# Install GitHub CLI
RUN curl -fsSL https://cli.github.com/packages/githubcli-archive-keyring.gpg \
      -o /usr/share/keyrings/githubcli-archive-keyring.gpg \
    && echo "deb [arch=$(dpkg --print-architecture) signed-by=...] ..." \
      > /etc/apt/sources.list.d/github-cli.list \
    && apt-get update && apt-get install -y --no-install-recommends gh \
    && rm -rf /var/lib/apt/lists/*

# Install uv (fast Python package manager)
RUN curl -LsSf https://astral.sh/uv/install.sh | sh

# Install kubectl (matching K3s version)
RUN curl -LO "https://dl.k8s.io/release/v1.34.4/bin/linux/arm64/kubectl" \
    && chmod +x kubectl && mv kubectl /usr/local/bin/

# Install helm
RUN curl -LO "https://get.helm.sh/helm-v3.17.3-linux-arm64.tar.gz" \
    && tar -xzf helm-v3.17.3-linux-arm64.tar.gz \
    && mv linux-arm64/helm /usr/local/bin/ && rm -rf linux-arm64 *.tar.gz

# Pre-install OpenClaw (eliminates 3-min npm install on every restart)
RUN npm install -g openclaw@beta 2>&1 | tail -3

EXPOSE 18789
ENTRYPOINT ["/bin/bash", "-c"]

A few key decisions worth explaining:

Base image: node:22-bookworm. OpenClaw needs Node.js, and Bookworm gives us a full Debian userland for the tools OpenClaw’s agents use.

Pre installed tools. ripgrep, jq, gh, ffmpeg, tmux, uv, kubectl, helm, and more. These unlock OpenClaw skills that require system tools. We went from 8 eligible skills to 13.

ARM64 cross build. Built on a Mac with docker buildx build --platform linux/arm64.

Chromium deliberately excluded. Saves approximately 500MB. Browser based skills can wait for a future version.

Hosted on GitHub Container Registry. Private repo, pulled with an image pull secret.

Startup time dropped from approximately three minutes to under ten seconds. That alone was worth the effort.

Step 6: The OpenClaw Deployment

Here is the heart of it. The Kubernetes deployment that runs the gateway. The entrypoint script is idempotent: it only creates config and credentials if they do not already exist on the PVC.

spec:
  containers:
    - name: openclaw
      image: ghcr.io/<YOUR_ORG>/openclaw-gateway:v1.1.0
      command: ["/bin/bash", "-c"]
      args:
        - |
          set -e
          echo "OpenClaw Gateway (custom image v1.1.0)"
          openclaw --version

          mkdir -p /root/.openclaw/credentials /root/.openclaw/devices

          if [ ! -f /root/.openclaw/openclaw.json ]; then
            echo "Creating initial config..."
            # ... create default config
          else
            echo "Config already exists on persistent volume, keeping it."
          fi

          echo "Starting OpenClaw Gateway..."
          exec openclaw gateway --port 18789 --bind lan
      env:
        - name: ANTHROPIC_API_KEY
          valueFrom:
            secretKeyRef:
              name: openclaw-secrets
              key: anthropic-api-key
        - name: OPENAI_API_KEY
          valueFrom:
            secretKeyRef:
              name: provider-keys-v2
              key: openai-api-key
        - name: GEMINI_API_KEY
          valueFrom:
            secretKeyRef:
              name: provider-keys-v2
              key: gemini-api-key
      resources:
        requests:
          cpu: 500m
          memory: 1Gi
        limits:
          cpu: 3000m
          memory: 3584Mi
      volumeMounts:
        - mountPath: /root/.openclaw
          name: openclaw-config
        - mountPath: /root/openclaw/workspace
          name: openclaw-workspace

A few things worth noting:

exec openclaw gateway. The exec replaces the shell process with the gateway, so the container gets proper signal handling. No zombie processes, clean shutdowns, correct health checks. Without exec, the gateway runs as a child of bash and Kubernetes cannot send signals to it properly.

API keys live in Kubernetes Secrets. Never hardcoded in config files. OpenClaw supports ${ENV_VAR} syntax in its config for referencing environment variables.

Resource limits are generous. The gateway gets up to 3 cores and 3.5GB of RAM on pi-2 (which has 4 cores and 7.6GB). AI agents can be memory hungry when processing long conversations.

RBAC gives the pod cluster admin. Yes, this is deliberately permissive. The Pi cluster IS the sandbox. My Mac is the security boundary. OpenClaw agents need to run kubectl, helm, and other cluster operations, and I would rather grant access explicitly than have agents fail silently.

A Gotcha: Gateway Bind vs Config Bind

This one took a while to figure out. The OpenClaw config says gateway.bind: "loopback", but the entrypoint starts the gateway with --bind lan. This looks like a bug, but it is intentional.

The --bind lan flag makes the gateway listen on 0.0.0.0, so Kubernetes service traffic can reach it. But the config’s loopback setting means CLI connections via 127.0.0.1 are treated as local. Loopback connections skip device pairing entirely. This is how cron jobs and CLI commands work inside the container without needing to pair a device first.

Step 7: Model Configuration and Cost Optimisation

OpenClaw supports multiple LLM providers. I configured a fallback chain that prioritises free and cheap models:

  1. Ollama Cloud / DeepSeek V3.1 (671B): Free tier, primary model
  2. Ollama Cloud / Qwen 3.5: Free tier, fallback
  3. Ollama Cloud / Devstral 2: Free tier, fallback
  4. Google / Gemini 2.0 Flash: Cheap, fast
  5. OpenAI / GPT-4.1 Mini: Last resort

The gateway also runs a heartbeat every 30 minutes during active hours (04:30 to 23:00 London time) using ollama-cloud/gemma3:12b, a free model that just checks the system is alive.

For premium models like Claude Opus, I enabled prompt caching with cacheRetention: "long" (one hour TTL). It doubles the write cost but saves significantly on subsequent reads in multi turn conversations. If you are having extended back and forth sessions, the savings add up, fast.

Step 8: Monitoring with Prometheus and Grafana

Observability was non negotiable. I deployed the kube-prometheus-stack Helm chart, but heavily customised for Pi constraints:

KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm install kube-prometheus-stack \
  prometheus-community/kube-prometheus-stack \
  -n monitoring --create-namespace \
  -f monitoring-values.yaml

The values file is tuned for minimal resource usage:

Prometheus: Two day retention, ephemeral storage (emptyDir, not PVC). If Prometheus restarts, I lose the last two days of metrics. For a home cluster, that is a perfectly acceptable tradeoff.

Grafana: No persistence, dashboards loaded via Helm values and sidecar ConfigMaps. If Grafana restarts, it rebuilds from config. Stateless by design.

Disabled false positive alerts: K3s bundles its control plane components differently from standard Kubernetes, so kubeControllerManager, kubeScheduler, kubeProxy, and kubeEtcd monitors are all disabled. Without this, you get a stream of noisy alerts about endpoints that simply do not exist.

Everything is pinned to pi-5 via nodeSelector: { openclaw-role: monitoring }.

Custom Alerts

I defined five alerts that cover the things I actually care about:

- alert: NodeDown
  expr: up{job="node-exporter"} == 0
  for: 2m

- alert: PodCrashLooping
  expr: increase(kube_pod_container_status_restarts_total[15m]) > 3
  for: 5m

- alert: HighCPUUsage
  expr: >-
    100 - (avg by(instance)
    (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85

- alert: HighMemoryUsage
  expr: >-
    (1 - (node_memory_MemAvailable_bytes
    / node_memory_MemTotal_bytes)) * 100 > 85

- alert: HighDiskUsage
  expr: >-
    (1 - (node_filesystem_avail_bytes{mountpoint="/"}
    / node_filesystem_size_bytes{mountpoint="/"})) * 100 > 80

Five alerts. Not fifty. Not five hundred. Five. Node down, pod crash looping, CPU hot, memory hot, disk filling up. If one of these fires, something genuinely needs attention. Everything else is noise.

Telegram Alerting

Alertmanager sends all critical and warning alerts to my Telegram via a bot. I get a nicely formatted HTML message with the alert name, namespace, severity, and description. The Watchdog and InfoInhibitor alerts are routed to a null receiver so they do not spam my phone.

receivers:
  - name: 'telegram'
    telegram_configs:
      - bot_token_file: '/etc/alertmanager/secrets/telegram-bot-token/bot-token'
        chat_id: <YOUR_CHAT_ID>
        parse_mode: 'HTML'

Getting alerts on the same app I use to chat with the AI gateway is oddly satisfying. Everything in one place.

Step 9: Automated Backups

A cluster without backups is a cluster waiting to teach you a painful lesson. I wrote a backup script that runs daily at 02:00 via a systemd timer on pi-1:

#!/bin/bash
# k3s-backup.sh — Daily backup for Pi K3s cluster
set -euo pipefail

BACKUP_ROOT="/mnt/usb-backup"
DATE=$(date +%Y-%m-%d_%H%M)
RETAIN_DAYS=7

# 1. K3s state.db
sudo cp /var/lib/rancher/k3s/server/db/state.db \
  "$BACKUP_ROOT/k3s/state-$DATE.db"

# 2. K8s manifests (all namespaces)
kubectl get deployments,statefulsets,daemonsets,services,configmaps,\
secrets,ingresses,persistentvolumeclaims -A -o yaml \
  > "$BACKUP_ROOT/manifests/cluster-$DATE.yaml"

# 3. OpenClaw full data (tar from running pod)
POD=$(kubectl -n openclaw get pods -l app=openclaw-gateway \
  -o jsonpath='{.items[0].metadata.name}')
kubectl -n openclaw exec "$POD" -- tar czf - -C /root .openclaw \
  > "$BACKUP_ROOT/openclaw/full-$DATE/openclaw-data.tar.gz"

# 4. Prune backups older than 7 days
find "$BACKUP_ROOT" -type f -mtime +$RETAIN_DAYS -delete

It backs up three things: the K3s etcd equivalent state database, a full YAML export of every resource in every namespace, and a tarball of the entire .openclaw directory from the running gateway pod.

Seven day retention keeps the USB drive from filling up. If I need to rebuild the cluster from scratch, I have everything I need.

Step 10: Housekeeping CronJobs

Two Kubernetes CronJobs keep things tidy:

Scratch cleanup (daily at 03:00): The gateway pod has a scratch volume at /scratch for temporary files. A busybox container prunes anything older than seven days.

Memory consolidation reminder (Mondays at 09:00): Sends me a Telegram message reminding me to review OpenClaw’s memory files and consolidate the week’s learnings. It is a small thing, but it keeps the agent’s context from growing unbounded. Left unchecked, memory files bloat and the agent’s performance degrades.

The Result

After all this work, what does the cluster look like?

28 pods across 5 nodes (down from approximately 75 before I removed Longhorn and local Ollama inference). Gateway startup in around 8 seconds, down from over three minutes. An always on AI gateway accessible from Telegram, WhatsApp, and a web UI. Full observability with Grafana dashboards and Telegram alerts on my phone. Daily automated backups to USB with seven day retention. A cost optimised model chain that defaults to free models and escalates only when needed. And 13 unlocked OpenClaw skills including GitHub, session logs, video frames, and tmux.

For a stack of five Raspberry Pis sitting on my desk, that is not bad at all.

Lessons Learned

Longhorn is amazing, but not for Pis. Distributed storage on ARM single board computers with SD cards is a recipe for frustration. local-path plus backups is the right answer for a home cluster.

Never run openclaw doctor --fix at startup. It destructively strips config values. My entrypoint script learned this the hard way. Twice.

Pre bake your Docker image. Any npm install or apt-get that runs on every pod start is a startup tax you will pay forever. Bake it into the image.

exec your entrypoint. Without exec, the gateway runs as a child of bash. Kubernetes cannot send signals to it properly, health checks do not work, and you get zombie processes on shutdown.

Persistent volumes are sacred. The moment you think “emptyDir is fine for now,” you are one restart away from losing data that matters. If it matters, give it a PVC.

Pin workloads to nodes. On a small cluster, nodeSelector is your best friend. You always know where to look, and you avoid resource contention between unrelated workloads.

The Pi cluster IS the sandbox. I gave the gateway pod cluster-admin and a privileged security context. On a cloud cluster this would be reckless. On a home lab where my Mac is the security boundary and I have daily backups, it is pragmatic.

Telegram is the best ops channel. Full stop.

The whole setup lives in a single directory of YAML files, a Dockerfile, and a backup script. No Terraform, no Pulumi, no GitOps controller. Just kubectl apply and helm install. For a home lab running an AI gateway, that is exactly the right level of complexity.

If you are thinking about self hosting OpenClaw, or any AI gateway, on Raspberry Pis, I hope this gives you a head start. The Pis are more than capable. The real work is in the decisions: what to simplify, what to automate, and what to just delete.
If you have Raspberry Pis to hand, give it a go! Enjoy!

References

  1. K3s: Lightweight Kubernetes

  2. OpenClaw: AI Gateway

  3. MetalLB: Bare Metal Load Balancer for Kubernetes

  4. cert-manager: x509 Certificate Management for Kubernetes

  5. Longhorn: Cloud Native Distributed Storage for Kubernetes

  6. kube-prometheus-stack Helm Chart

  7. Kubernetes Local Path Provisioner

  8. Docker Buildx: Multi Platform Builds

  9. Kubernetes Persistent Volumes Documentation

  10. Alertmanager Telegram Integration