clouder Kubeadm
Provision, set up, and manage Kubeadm Kubernetes clusters with CRIU support. Handles the full lifecycle: VM creation, software installation, cluster initialization, CRIU configuration, testing, kubeconfig retrieval, and cluster teardown.
Overview
# Create the VMs (1 master + 1 worker)
clouder Kubeadm vm-create --workers 1 k1
# Set up Kubernetes (containerd 2.x, CRIU, buildah, Kubeadm, Flannel CNI, feature gates on all nodes)
clouder Kubeadm setup k1
# Fetch the kubeconfig to ~/.clouder/kubeconfig-k1
clouder Kubeadm get-config k1
# Verify
clouder kubectl k1 get nodes
clouder kubectl k1 get pods -A
# Enable ingress with Traefik controller and cloud-specific load balancer setup
clouder Kubeadm enable-ingress-traefik k1
clouder helm k1 ls -A
# Run the smoke test (Ingress + CRIU validation)
clouder Kubeadm smoke-test k1
# Upgrade kubelet/Kubeadm/kubectl on all nodes (e.g. after bumping K8S_VERSION)
clouder Kubeadm upgrade-kubelet k1
# Disable ingress
clouder Kubeadm disable-ingress-traefik k1
# Tear down the cluster and delete all cloud resources
clouder Kubeadm vm-terminate k1
Commands
clouder Kubeadm vm-create
Create VMs for a Kubeadm cluster: 1 master + N worker nodes on the same subnet.
clouder Kubeadm vm-create <name> [OPTIONS]
| Argument | Description |
|---|---|
name (required) | Cluster name (used as prefix for all VMs) |
| Option | Short | Description |
|---|---|---|
--workers | -w | Number of worker nodes (default: 3) |
--region | -r | Cloud region (e.g. eastus, us-east-1) |
--resource-group | -g | Resource group (Azure only) |
--master-size | VM size for the master node (default: Standard_B4ms) | |
--node-size | VM size for worker nodes (default: Standard_B4ms) | |
--os-disk-size | OS disk size in GB (default: 100, min 30) | |
--admin-user | Admin username (default: azureuser on Azure, ubuntu on AWS) | |
--image | Image: Ubuntu2204, Ubuntu2404, Debian12 (Azure only, default: Ubuntu2204) |
What gets created:
| Resource | Name | Description |
|---|---|---|
| VNet | <name>-vnet | Virtual network with 10.0.0.0/16 address space |
| Subnet | <name>-subnet | Subnet with 10.0.0.0/24 range |
| NSG | <name>-nsg | Network Security Group with rules (see below) |
| Master VM | <name>-master | Control plane node |
| Worker VMs | <name>-node-1, <name>-node-2, ... | Worker nodes |
NSG rules:
| Rule | Port | Source | Purpose |
|---|---|---|---|
| AllowSSH | 22 | Any | SSH access |
| AllowK8sAPI | 6443 | Any | Kubernetes API server |
| AllowKubelet | 10250 | VNet (10.0.0.0/16) | Kubelet communication |
| AllowNodePorts | 30000-32767 | Any | Kubernetes NodePort services |
| AllowHTTP | 80 | Any | HTTP ingress traffic |
| AllowHTTPS | 443 | Any | HTTPS ingress traffic |
All VMs share the same VNet, subnet, and NSG — they can communicate with each other on private IPs while being individually accessible via public IPs.
Interactive prompts (same as clouder vm create):
- Resource group — pick existing or create new
- Region — only if creating a new resource group
- SSH key — pick existing or generate new
clouder Kubeadm setup
Set up a Kubeadm cluster with CRIU support on previously created VMs. This is the main automation command — it SSHes into each VM and runs all setup steps.
clouder Kubeadm setup <name> [OPTIONS]
| Argument | Description |
|---|---|
name (required) | Cluster name (must match the vm-create name) |
| Option | Short | Description |
|---|---|---|
--admin-user | -u | SSH username (default: ubuntu) |
--key | -i | SSH key name from ~/.ssh/ |
--k8s-version | Kubernetes version (default: 1.32) |
What it does (6 steps):
| Step | Target | Description |
|---|---|---|
| 1. Prerequisites | All nodes | Disable swap, load kernel modules (overlay, br_netfilter), configure sysctl, install containerd 2.x (from Docker repo, with SystemdCgroup), install CRIU, install buildah, install Kubeadm/kubelet/kubectl |
| 2. Kubeadm init | Master | Initialize the control plane with --pod-network-cidr=10.244.0.0/16, enable ContainerCheckpoint feature gate |
| 3. Install CNI | Master | Install Flannel CNI for pod networking |
| 4. Kubeadm join | Workers | Join all worker nodes to the cluster |
| 5. CRIU feature gates | All nodes | Enable ContainerCheckpoint feature gate on all kubelets (master + workers) |
| 6. Cloud storage and load balancer bootstrap | Cloud-specific | Azure: deploy cloud config, install Azure Disk CSI (managed-csi) and Azure File CSI (azure-nfs). AWS: install AWS EBS CSI with default gp3 StorageClass and install AWS Load Balancer Controller (instance profile preferred, static credentials fallback). |
Azure storage setup requires a service principal. If AZURE_TENANT_ID, AZURE_CLIENT_ID, and AZURE_CLIENT_SECRET are set, they are used. Otherwise, Clouder attempts to create a scoped SP automatically.
AWS storage setup prefers EC2 instance profiles on node VMs. If no instance profile is attached, Clouder falls back to the active AWS credentials in the current session.
On AWS, load balancer setup installs AWS Load Balancer Controller and expects node IAM permissions for ELB operations.
Example:
clouder Kubeadm setup my-cluster
clouder Kubeadm setup my-cluster --k8s-version 1.31 --key my-cluster-key
clouder Kubeadm get-config
Fetch the kubeconfig from the master node and save it locally.
clouder Kubeadm get-config <name> [OPTIONS]
| Argument | Description |
|---|---|
name (required) | Cluster name |
| Option | Short | Description |
|---|---|---|
--admin-user | -u | SSH username (default: azureuser) |
--key | -i | SSH key name from ~/.ssh/ |
The kubeconfig is saved to ~/.clouder/kubeconfigs/kubeconfig-<name> with 600 permissions. The internal API server IP is automatically replaced with the master's public IP for remote access.
Additionally, the kubelet client certificates (apiserver-kubelet-client.crt and .key) are fetched from the master and saved to ~/.clouder/kubeconfigs/<name>/. These are needed for CRIU checkpoint API calls via the kubelet.
Example:
clouder Kubeadm get-config my-cluster
# Output:
# Kubeconfig saved to ~/.clouder/kubeconfigs/kubeconfig-my-cluster
# Kubelet client cert saved to ~/.clouder/kubeconfigs/my-cluster/apiserver-kubelet-client.crt
# Kubelet client key saved to ~/.clouder/kubeconfigs/my-cluster/apiserver-kubelet-client.key
#
# Usage:
# export KUBECONFIG=~/.clouder/kubeconfigs/kubeconfig-my-cluster
# export KUBELET_CLIENT_CERT=~/.clouder/kubeconfigs/my-cluster/apiserver-kubelet-client.crt
# export KUBELET_CLIENT_KEY=~/.clouder/kubeconfigs/my-cluster/apiserver-kubelet-client.key
# kubectl get nodes
clouder Kubeadm info
Show cluster information and next steps. Displays the current state of a Kubeadm cluster and lists useful commands for day-to-day operations as well as further setup steps.
clouder Kubeadm info <name>
| Argument | Description |
|---|---|
name (required) | Cluster name |
Example:
clouder Kubeadm info my-cluster
clouder Kubeadm scale
Scale the number of worker nodes in an existing Kubeadm cluster. Compares the desired worker count with the current count, then:
- Scale up: creates new VMs, installs prerequisites, joins them to the cluster.
- Scale down: drains and deletes the highest-numbered worker nodes.
clouder Kubeadm scale <name> [OPTIONS]
| Argument | Description |
|---|---|
name (required) | Cluster name |
| Option | Short | Description |
|---|---|---|
--workers (required) | -w | Desired number of worker nodes |
--admin-user | -u | SSH username (default: azureuser) |
--key | -i | SSH key name from ~/.ssh/ |
--force | -f | Skip confirmation prompt |
Example:
# Scale up to 5 workers
clouder Kubeadm scale my-cluster --workers 5
# Scale down to 1 worker (force, no confirmation)
clouder Kubeadm scale my-cluster --workers 1 --force
clouder Kubeadm vm-terminate
Terminate all VMs and networking for a Kubeadm cluster. Deletes VMs, NICs, public IPs, OS disks, NSG, and VNet.
clouder Kubeadm vm-terminate <name> [OPTIONS]
| Argument | Description |
|---|---|
name (required) | Cluster name |
| Option | Short | Description |
|---|---|---|
--force | -f | Skip confirmation prompt |
--delete-rg | Also delete the resource group |
Deletion order:
| Step | Resources | Notes |
|---|---|---|
| 1 | VMs | <name>-master, <name>-node-* |
| 2 | NICs | Associated network interfaces |
| 3 | Public IPs | Associated public IP addresses |
| 4 | OS Disks | Managed disks created with VMs |
| 5 | Load Balancer | <name>-lb and <name>-lb-ip (if exists) |
| 6 | NSG | <name>-nsg |
| 7 | VNet | <name>-vnet (includes subnet) |
| 8 | Resource Group | Only if --delete-rg is specified |
The local kubeconfig (~/.clouder/kubeconfigs/kubeconfig-<name>) is also removed if it exists.
Example:
# Interactive confirmation
clouder Kubeadm vm-terminate my-cluster
# Force delete without confirmation
clouder Kubeadm vm-terminate my-cluster --force
# Also delete the resource group
clouder Kubeadm vm-terminate my-cluster --force --delete-rg
clouder Kubeadm smoke-test
Run an end-to-end smoke test with two phases: CRIU checkpoint/restore validation and ingress load balancer validation.
clouder Kubeadm smoke-test <name> [OPTIONS]
| Argument | Description |
|---|---|
name (required) | Cluster name |
| Option | Short | Description |
|---|---|---|
--admin-user | -u | SSH username (default: azureuser) |
--key | -i | SSH key name from ~/.ssh/ |
--cleanup / --no-cleanup | Clean up test pods after the test (default: --cleanup) |
Section 1 — Ingress LB (4 steps, skipped if no ingress controller):
| Step | Description |
|---|---|
| 1 | Deploy test nginx Deployment + Service (clouder-smoke-web), show resources with kubectl get |
| 2 | Create an Ingress resource routing / to the test service, show with kubectl describe |
| 3 | Curl the Azure LB public IP from localhost and validate HTTP 200 (up to 5 retries, shows response headers and body) |
| 4 | If DATALAYER_RUN_URL is set, curl the DNS hostname and validate HTTP 200 (shows response headers and body). Skipped with a hint if the env var is not set. |
Section 1 is automatically skipped if no ingress controller (nginx or traefik) is detected. Run enable-ingress-nginx or enable-ingress-traefik first.
Section 2 — CRIU (8 steps):
| Step | Description |
|---|---|
| 1 | Deploy a busybox counter pod (clouder-smoke-test) |
| 2 | Wait for pod to be ready |
| 3 | Let the counter accumulate state for 15 seconds |
| 4 | Identify which worker node the pod is running on |
| 5 | Checkpoint via kubelet API using client certificates |
| 6 | Verify checkpoint archive exists on the worker node |
| 7 | Delete the original pod |
| 8 | Import checkpoint into containerd (via buildah), deploy restored pod, validate counter |
Section 2 requires containerd 2.0+ (installed by setup). On older containerd versions, the checkpoint API returns "method not implemented" and this section reports a warning instead of failing.
Result interpretation:
| Counter after restore | Meaning |
|---|---|
| ≥ checkpoint value | CRIU restore: PASSED — process continued from checkpoint |
| < checkpoint value | Checkpoint created, but full CRIU restore requires containerd 2.0+ |
| Pod doesn't start | Checkpoint image format not compatible with current containerd |
Example:
clouder Kubeadm smoke-test my-cluster
# Keep test pods for inspection
clouder Kubeadm smoke-test my-cluster --no-cleanup
clouder Kubeadm enable-ingress-nginx
Deploy an ingress-nginx controller and create an Azure Load Balancer with a public IP that forwards HTTP/HTTPS traffic to the cluster.
clouder Kubeadm enable-ingress-nginx <name> [OPTIONS]
| Argument | Description |
|---|---|
name (required) | Cluster name |
| Option | Short | Description |
|---|---|---|
--admin-user | -u | SSH username (default: azureuser) |
--key | -i | SSH key name from ~/.ssh/ |
What it does (4 steps):
| Step | Description |
|---|---|
| 1 | Deploy ingress-nginx controller in NodePort mode on the cluster |
| 2 | Read the assigned NodePorts for HTTP and HTTPS |
| 3 | Create Azure Load Balancer (<name>-lb) with public IP, rules 80→HTTP NodePort, 443→HTTPS NodePort |
| 4 | Add all worker VM NICs to the LB backend pool |
Architecture:
Internet → Azure LB (public IP)
├── :80 → worker NICs → NodePort → ingress-nginx → Ingress rules → Services
└── :443 → worker NICs → NodePort → ingress-nginx → Ingress rules → Services
No Azure Cloud Controller Manager is needed — the LB is provisioned externally via Azure SDK.
Azure resources created:
| Resource | Name | Description |
|---|---|---|
| Public IP | <name>-lb-ip | Standard SKU, static allocation |
| Load Balancer | <name>-lb | Standard SKU with HTTP/HTTPS rules |
Example:
clouder Kubeadm enable-ingress-nginx my-cluster
# After enabling, create Ingress resources:
clouder kubectl my-cluster create ingress demo --rule='/*=my-svc:80'
clouder Kubeadm disable-ingress-nginx
Remove the Azure Load Balancer and ingress-nginx controller from the cluster.
clouder Kubeadm disable-ingress-nginx <name> [OPTIONS]
| Argument | Description |
|---|---|
name (required) | Cluster name |
| Option | Short | Description |
|---|---|---|
--admin-user | -u | SSH username (default: azureuser) |
--key | -i | SSH key name from ~/.ssh/ |
--force | -f | Skip confirmation prompt |
What it deletes:
- Azure Load Balancer (
<name>-lb) - LB public IP (
<name>-lb-ip) - ingress-nginx namespace and all its resources
Example:
clouder Kubeadm disable-ingress-nginx my-cluster
clouder Kubeadm disable-ingress-nginx my-cluster --force
clouder Kubeadm enable-ingress-traefik
Deploy a Traefik ingress controller (via Helm) and create an Azure Load Balancer with a public IP that forwards HTTP/HTTPS traffic to the cluster.
clouder Kubeadm enable-ingress-traefik <name> [OPTIONS]
| Argument | Description |
|---|---|
name (required) | Cluster name |
| Option | Short | Description |
|---|---|---|
--admin-user | -u | SSH username (default: azureuser) |
--key | -i | SSH key name from ~/.ssh/ |
What it does (4 steps):
| Step | Description |
|---|---|
| 1 | Install Helm (if needed), deploy Traefik via Helm in NodePort mode |
| 2 | Read the assigned NodePorts for HTTP (web) and HTTPS (websecure) |
| 3 | Create Azure Load Balancer (<name>-lb) with public IP, rules 80→HTTP NodePort, 443→HTTPS NodePort |
| 4 | Add all worker VM NICs to the LB backend pool |
Architecture:
Internet → Azure LB (public IP)
├── :80 → worker NICs → NodePort → Traefik → IngressRoute/Ingress → Services
└── :443 → worker NICs → NodePort → Traefik → IngressRoute/Ingress → Services
Example:
clouder Kubeadm enable-ingress-traefik my-cluster
# After enabling, create Ingress resources:
clouder kubectl my-cluster create ingress demo --rule='/*=my-svc:80'
clouder Kubeadm disable-ingress-traefik
Remove the Azure Load Balancer and Traefik ingress controller from the cluster.
clouder Kubeadm disable-ingress-traefik <name> [OPTIONS]
| Argument | Description |
|---|---|
name (required) | Cluster name |
| Option | Short | Description |
|---|---|---|
--admin-user | -u | SSH username (default: azureuser) |
--key | -i | SSH key name from ~/.ssh/ |
--force | -f | Skip confirmation prompt |
What it deletes:
- Azure Load Balancer (
<name>-lb) - LB public IP (
<name>-lb-ip) - Traefik Helm release and traefik namespace
Example:
clouder Kubeadm disable-ingress-traefik my-cluster
clouder Kubeadm disable-ingress-traefik my-cluster --force
clouder Kubeadm upgrade-kubelet
Upgrade kubelet, Kubeadm, and kubectl on all nodes of an existing cluster to the target Kubernetes version (K8S_VERSION, currently 1.32). This is needed when the cluster was originally provisioned with an older Kubernetes version and requires features available in newer kubelet releases (e.g. the checkpoint timeout query parameter added in Kubernetes 1.30+).
clouder Kubeadm upgrade-kubelet <name> [OPTIONS]
| Argument | Description |
|---|---|
name (required) | Cluster name (must match the vm-create name) |
| Option | Short | Description |
|---|---|---|
--admin-user | -u | SSH username (default: azureuser) |
--key | -i | SSH key name from ~/.ssh/ |
What it does (per node):
| Step | Description |
|---|---|
| 1 | Update the Kubernetes apt repository to the target version |
| 2 | Unhold kubelet, Kubeadm, kubectl packages |
| 3 | Upgrade all three packages to the latest patch of vK8S_VERSION |
| 4 | Re-hold the packages to prevent accidental upgrades |
| 5 | Run systemctl daemon-reload && systemctl restart kubelet |
Nodes are upgraded sequentially (master first, then workers). Pods are NOT drained — for zero-downtime upgrades, drain nodes manually before running this command.
This command upgrades the kubelet binary and restarts it. It does not run Kubeadm upgrade apply (control plane component upgrade). For a full Kubernetes version upgrade, run Kubeadm upgrade apply on the master node separately.
Example:
# Upgrade all nodes to the target K8s version
clouder Kubeadm upgrade-kubelet my-cluster
# With a specific SSH key
clouder Kubeadm upgrade-kubelet my-cluster --key my-cluster-key
Typical Workflow
Create a Cluster
Three commands are needed to go from zero to a running Kubernetes cluster with CRIU support:
# 1. Create the VMs (1 master + 3 workers on Azure)
clouder Kubeadm vm-create my-cluster
# 2. Set up Kubernetes on all nodes
clouder Kubeadm setup my-cluster
# 3. Fetch the kubeconfig
clouder Kubeadm get-config my-cluster
vm-create provisions 4 VMs (Standard_B4ms, 100 GB OS disks) in Azure with a VNet, subnet, and NSG.
setup installs containerd 2.x, CRIU, buildah, Kubeadm/kubelet/kubectl, initializes the control plane with Flannel CNI, joins workers, enables the ContainerCheckpoint feature gate on every node, and installs the Azure Disk CSI and Azure File CSI drivers for dynamic persistent volume provisioning.
get-config downloads the kubeconfig to ~/.clouder/kubeconfigs/kubeconfig-my-cluster.
Use the Cluster
# Run kubectl via the persisted kubeconfig
clouder kubectl my-cluster get nodes
clouder kubectl my-cluster get pods -A
clouder kubectl my-cluster apply -f deployment.yaml
Add an Ingress Controller
Pick one of nginx or traefik. Both create an Azure Load Balancer with a public IP:
# Option A: ingress-nginx
clouder Kubeadm enable-ingress-nginx my-cluster
# Option B: traefik
clouder Kubeadm enable-ingress-traefik my-cluster