clouder kubeadm
Provision, set up, and manage kubeadm Kubernetes clusters with CRIU support. Handles the full lifecycle: VM creation, software installation, cluster initialization, CRIU configuration, testing, kubeconfig retrieval, and cluster teardown.
Overview
# Create the VMs (1 master + 1 worker)
clouder kubeadm vm-create --workers 1 k1
# Set up Kubernetes (containerd 2.x, CRIU, buildah, kubeadm, Flannel CNI, feature gates on all nodes)
clouder kubeadm setup k1
# Fetch the kubeconfig to ~/.clouder/kubeconfig-k1
clouder kubeadm get-config k1
# Verify
clouder kubectl k1 get nodes
clouder kubectl k1 get pods -A
# Enable ingress with Traefik controller and Azure Load Balancer
clouder kubeadm enable-ingress-traefik k1
clouder helm k1 ls -A
# Run the smoke test (Ingress + CRIU validation)
clouder kubeadm smoke-test k1
# Disable ingress
clouder kubeadm disable-ingress-traefik k1
# Tear down the cluster and delete all VMs and Azure resources
clouder kubeadm vm-terminate k1
Commands
clouder kubeadm vm-create
Create VMs for a kubeadm cluster: 1 master + N worker nodes on the same subnet.
clouder kubeadm vm-create <name> [OPTIONS]
| Argument | Description |
|---|---|
name (required) | Cluster name (used as prefix for all VMs) |
| Option | Short | Description |
|---|---|---|
--workers | -w | Number of worker nodes (default: 3) |
--region | -r | Azure region |
--resource-group | -g | Resource group |
--master-size | VM size for the master node (default: Standard_B4ms) | |
--node-size | VM size for worker nodes (default: Standard_B4ms) | |
--os-disk-size | OS disk size in GB (default: 100, min 30) | |
--admin-user | Admin username (default: azureuser) | |
--image | Image: Ubuntu2204, Ubuntu2404, Debian12 (default: Ubuntu2204) |
What gets created:
| Resource | Name | Description |
|---|---|---|
| VNet | <name>-vnet | Virtual network with 10.0.0.0/16 address space |
| Subnet | <name>-subnet | Subnet with 10.0.0.0/24 range |
| NSG | <name>-nsg | Network Security Group with rules (see below) |
| Master VM | <name>-master | Control plane node |
| Worker VMs | <name>-node-1, <name>-node-2, ... | Worker nodes |
NSG rules:
| Rule | Port | Source | Purpose |
|---|---|---|---|
| AllowSSH | 22 | Any | SSH access |
| AllowK8sAPI | 6443 | Any | Kubernetes API server |
| AllowKubelet | 10250 | VNet (10.0.0.0/16) | Kubelet communication |
| AllowNodePorts | 30000-32767 | Any | Kubernetes NodePort services |
| AllowHTTP | 80 | Any | HTTP ingress traffic |
| AllowHTTPS | 443 | Any | HTTPS ingress traffic |
All VMs share the same VNet, subnet, and NSG — they can communicate with each other on private IPs while being individually accessible via public IPs.
Interactive prompts (same as clouder vm create):
- Resource group — pick existing or create new
- Region — only if creating a new resource group
- SSH key — pick existing or generate new
clouder kubeadm setup
Set up a kubeadm cluster with CRIU support on previously created VMs. This is the main automation command — it SSHes into each VM and runs all setup steps.
clouder kubeadm setup <name> [OPTIONS]
| Argument | Description |
|---|---|
name (required) | Cluster name (must match the vm-create name) |
| Option | Short | Description |
|---|---|---|
--admin-user | -u | SSH username (default: azureuser) |
--key | -i | SSH key name from ~/.ssh/ |
--k8s-version | Kubernetes version (default: 1.32) |
What it does (6 steps):
| Step | Target | Description |
|---|---|---|
| 1. Prerequisites | All nodes | Disable swap, load kernel modules (overlay, br_netfilter), configure sysctl, install containerd 2.x (from Docker repo, with SystemdCgroup), install CRIU, install buildah, install kubeadm/kubelet/kubectl |
| 2. kubeadm init | Master | Initialize the control plane with --pod-network-cidr=10.244.0.0/16, enable ContainerCheckpoint feature gate |
| 3. Install CNI | Master | Install Flannel CNI for pod networking |
| 4. kubeadm join | Workers | Join all worker nodes to the cluster |
| 5. CRIU feature gates | All nodes | Enable ContainerCheckpoint feature gate on all kubelets (master + workers) |
| 6. Azure Disk CSI | All nodes + Master | Deploy /etc/kubernetes/azure.json cloud config to all nodes, create azure-cloud-provider secret, install Azure Disk CSI driver v1.30.3, and create default managed-csi StorageClass (StandardSSD_LRS) |
| 7. Azure File CSI | Master | Install Azure File CSI driver v1.30.6 and create azure-nfs StorageClass with subscription, resource group, and location baked in |
Step 6 requires an Azure service principal. If AZURE_TENANT_ID, AZURE_CLIENT_ID, and AZURE_CLIENT_SECRET environment variables are set, they will be used. Otherwise, a new SP scoped to the cluster's resource group is auto-created via az ad sp create-for-rbac. If SP creation fails, the step is skipped with a warning — the cluster is still usable but dynamic persistent volume provisioning won't work until the CSI driver is installed manually.
Step 7 installs the Azure File CSI driver and creates the azure-nfs StorageClass with the Azure subscription, resource group, and location baked in from the cluster metadata. This is required because kubeadm nodes have no Azure instance metadata (unlike AKS). The plane up datalayer-shared-filesystem command relies on this StorageClass already existing.
Example:
clouder kubeadm setup my-cluster
clouder kubeadm setup my-cluster --k8s-version 1.31 --key my-cluster-key
clouder kubeadm get-config
Fetch the kubeconfig from the master node and save it locally.
clouder kubeadm get-config <name> [OPTIONS]
| Argument | Description |
|---|---|
name (required) | Cluster name |
| Option | Short | Description |
|---|---|---|
--admin-user | -u | SSH username (default: azureuser) |
--key | -i | SSH key name from ~/.ssh/ |
The kubeconfig is saved to ~/.clouder/kubeconfigs/kubeconfig-<name> with 600 permissions. The internal API server IP is automatically replaced with the master's public IP for remote access.
Example:
clouder kubeadm get-config my-cluster
clouder kubeadm info
Show cluster information and next steps. Displays the current state of a kubeadm cluster and lists useful commands for day-to-day operations as well as further setup steps.
clouder kubeadm info <name>
| Argument | Description |
|---|---|
name (required) | Cluster name |
Example:
clouder kubeadm info my-cluster
clouder kubeadm scale
Scale the number of worker nodes in an existing kubeadm cluster. Compares the desired worker count with the current count, then:
- Scale up: creates new VMs, installs prerequisites, joins them to the cluster.
- Scale down: drains and deletes the highest-numbered worker nodes.
clouder kubeadm scale <name> [OPTIONS]
| Argument | Description |
|---|---|
name (required) | Cluster name |
| Option | Short | Description |
|---|---|---|
--workers (required) | -w | Desired number of worker nodes |
--admin-user | -u | SSH username (default: azureuser) |
--key | -i | SSH key name from ~/.ssh/ |
--force | -f | Skip confirmation prompt |
Example:
# Scale up to 5 workers
clouder kubeadm scale my-cluster --workers 5
# Scale down to 1 worker (force, no confirmation)
clouder kubeadm scale my-cluster --workers 1 --force
clouder kubeadm vm-terminate
Terminate all VMs and networking for a kubeadm cluster. Deletes VMs, NICs, public IPs, OS disks, NSG, and VNet.
clouder kubeadm vm-terminate <name> [OPTIONS]
| Argument | Description |
|---|---|
name (required) | Cluster name |
| Option | Short | Description |
|---|---|---|
--force | -f | Skip confirmation prompt |
--delete-rg | Also delete the resource group |
Deletion order:
| Step | Resources | Notes |
|---|---|---|
| 1 | VMs | <name>-master, <name>-node-* |
| 2 | NICs | Associated network interfaces |
| 3 | Public IPs | Associated public IP addresses |
| 4 | OS Disks | Managed disks created with VMs |
| 5 | Load Balancer | <name>-lb and <name>-lb-ip (if exists) |
| 6 | NSG | <name>-nsg |
| 7 | VNet | <name>-vnet (includes subnet) |
| 8 | Resource Group | Only if --delete-rg is specified |
The local kubeconfig (~/.clouder/kubeconfigs/kubeconfig-<name>) is also removed if it exists.
Example:
# Interactive confirmation
clouder kubeadm vm-terminate my-cluster
# Force delete without confirmation
clouder kubeadm vm-terminate my-cluster --force
# Also delete the resource group
clouder kubeadm vm-terminate my-cluster --force --delete-rg
clouder kubeadm smoke-test
Run an end-to-end smoke test with two phases: CRIU checkpoint/restore validation and ingress load balancer validation.
clouder kubeadm smoke-test <name> [OPTIONS]
| Argument | Description |
|---|---|
name (required) | Cluster name |
| Option | Short | Description |
|---|---|---|
--admin-user | -u | SSH username (default: azureuser) |
--key | -i | SSH key name from ~/.ssh/ |
--cleanup / --no-cleanup | Clean up test pods after the test (default: --cleanup) |
Section 1 — Ingress LB (4 steps, skipped if no ingress controller):
| Step | Description |
|---|---|
| 1 | Deploy test nginx Deployment + Service (clouder-smoke-web), show resources with kubectl get |
| 2 | Create an Ingress resource routing / to the test service, show with kubectl describe |
| 3 | Curl the Azure LB public IP from localhost and validate HTTP 200 (up to 5 retries, shows response headers and body) |
| 4 | If DATALAYER_RUN_URL is set, curl the DNS hostname and validate HTTP 200 (shows response headers and body). Skipped with a hint if the env var is not set. |
Section 1 is automatically skipped if no ingress controller (nginx or traefik) is detected. Run enable-ingress-nginx or enable-ingress-traefik first.
Section 2 — CRIU (8 steps):
| Step | Description |
|---|---|
| 1 | Deploy a busybox counter pod (clouder-smoke-test) |
| 2 | Wait for pod to be ready |
| 3 | Let the counter accumulate state for 15 seconds |
| 4 | Identify which worker node the pod is running on |
| 5 | Checkpoint via kubelet API using client certificates |
| 6 | Verify checkpoint archive exists on the worker node |
| 7 | Delete the original pod |
| 8 | Import checkpoint into containerd (via buildah), deploy restored pod, validate counter |
Section 2 requires containerd 2.0+ (installed by setup). On older containerd versions, the checkpoint API returns "method not implemented" and this section reports a warning instead of failing.
Result interpretation:
| Counter after restore | Meaning |
|---|---|
| ≥ checkpoint value | CRIU restore: PASSED — process continued from checkpoint |
| < checkpoint value | Checkpoint created, but full CRIU restore requires containerd 2.0+ |
| Pod doesn't start | Checkpoint image format not compatible with current containerd |
Example:
clouder kubeadm smoke-test my-cluster
# Keep test pods for inspection
clouder kubeadm smoke-test my-cluster --no-cleanup
clouder kubeadm enable-ingress-nginx
Deploy an ingress-nginx controller and create an Azure Load Balancer with a public IP that forwards HTTP/HTTPS traffic to the cluster.
clouder kubeadm enable-ingress-nginx <name> [OPTIONS]
| Argument | Description |
|---|---|
name (required) | Cluster name |
| Option | Short | Description |
|---|---|---|
--admin-user | -u | SSH username (default: azureuser) |
--key | -i | SSH key name from ~/.ssh/ |
What it does (4 steps):
| Step | Description |
|---|---|
| 1 | Deploy ingress-nginx controller in NodePort mode on the cluster |
| 2 | Read the assigned NodePorts for HTTP and HTTPS |
| 3 | Create Azure Load Balancer (<name>-lb) with public IP, rules 80→HTTP NodePort, 443→HTTPS NodePort |
| 4 | Add all worker VM NICs to the LB backend pool |
Architecture:
Internet → Azure LB (public IP)
├── :80 → worker NICs → NodePort → ingress-nginx → Ingress rules → Services
└── :443 → worker NICs → NodePort → ingress-nginx → Ingress rules → Services
No Azure Cloud Controller Manager is needed — the LB is provisioned externally via Azure SDK.
Azure resources created:
| Resource | Name | Description |
|---|---|---|
| Public IP | <name>-lb-ip | Standard SKU, static allocation |
| Load Balancer | <name>-lb | Standard SKU with HTTP/HTTPS rules |
Example:
clouder kubeadm enable-ingress-nginx my-cluster
# After enabling, create Ingress resources:
clouder kubectl my-cluster create ingress demo --rule='/*=my-svc:80'
clouder kubeadm disable-ingress-nginx
Remove the Azure Load Balancer and ingress-nginx controller from the cluster.
clouder kubeadm disable-ingress-nginx <name> [OPTIONS]
| Argument | Description |
|---|---|
name (required) | Cluster name |
| Option | Short | Description |
|---|---|---|
--admin-user | -u | SSH username (default: azureuser) |
--key | -i | SSH key name from ~/.ssh/ |
--force | -f | Skip confirmation prompt |
What it deletes:
- Azure Load Balancer (
<name>-lb) - LB public IP (
<name>-lb-ip) - ingress-nginx namespace and all its resources
Example:
clouder kubeadm disable-ingress-nginx my-cluster
clouder kubeadm disable-ingress-nginx my-cluster --force
clouder kubeadm enable-ingress-traefik
Deploy a Traefik ingress controller (via Helm) and create an Azure Load Balancer with a public IP that forwards HTTP/HTTPS traffic to the cluster.
clouder kubeadm enable-ingress-traefik <name> [OPTIONS]
| Argument | Description |
|---|---|
name (required) | Cluster name |
| Option | Short | Description |
|---|---|---|
--admin-user | -u | SSH username (default: azureuser) |
--key | -i | SSH key name from ~/.ssh/ |
What it does (4 steps):
| Step | Description |
|---|---|
| 1 | Install Helm (if needed), deploy Traefik via Helm in NodePort mode |
| 2 | Read the assigned NodePorts for HTTP (web) and HTTPS (websecure) |
| 3 | Create Azure Load Balancer (<name>-lb) with public IP, rules 80→HTTP NodePort, 443→HTTPS NodePort |
| 4 | Add all worker VM NICs to the LB backend pool |
Architecture:
Internet → Azure LB (public IP)
├── :80 → worker NICs → NodePort → Traefik → IngressRoute/Ingress → Services
└── :443 → worker NICs → NodePort → Traefik → IngressRoute/Ingress → Services
Example:
clouder kubeadm enable-ingress-traefik my-cluster
# After enabling, create Ingress resources:
clouder kubectl my-cluster create ingress demo --rule='/*=my-svc:80'
clouder kubeadm disable-ingress-traefik
Remove the Azure Load Balancer and Traefik ingress controller from the cluster.
clouder kubeadm disable-ingress-traefik <name> [OPTIONS]
| Argument | Description |
|---|---|
name (required) | Cluster name |
| Option | Short | Description |
|---|---|---|
--admin-user | -u | SSH username (default: azureuser) |
--key | -i | SSH key name from ~/.ssh/ |
--force | -f | Skip confirmation prompt |
What it deletes:
- Azure Load Balancer (
<name>-lb) - LB public IP (
<name>-lb-ip) - Traefik Helm release and traefik namespace
Example:
clouder kubeadm disable-ingress-traefik my-cluster
clouder kubeadm disable-ingress-traefik my-cluster --force
Typical Workflow
Create a Cluster
Three commands are needed to go from zero to a running Kubernetes cluster with CRIU support:
# 1. Create the VMs (1 master + 3 workers on Azure)
clouder kubeadm vm-create my-cluster
# 2. Set up Kubernetes on all nodes
clouder kubeadm setup my-cluster
# 3. Fetch the kubeconfig
clouder kubeadm get-config my-cluster
vm-create provisions 4 VMs (Standard_B4ms, 100 GB OS disks) in Azure with a VNet, subnet, and NSG.
setup installs containerd 2.x, CRIU, buildah, kubeadm/kubelet/kubectl, initializes the control plane with Flannel CNI, joins workers, enables the ContainerCheckpoint feature gate on every node, and installs the Azure Disk CSI and Azure File CSI drivers for dynamic persistent volume provisioning.
get-config downloads the kubeconfig to ~/.clouder/kubeconfigs/kubeconfig-my-cluster.
Use the Cluster
# Run kubectl via the persisted kubeconfig
clouder kubectl my-cluster get nodes
clouder kubectl my-cluster get pods -A
clouder kubectl my-cluster apply -f deployment.yaml
Add an Ingress Controller
Pick one of nginx or traefik. Both create an Azure Load Balancer with a public IP:
# Option A: ingress-nginx
clouder kubeadm enable-ingress-nginx my-cluster
# Option B: traefik
clouder kubeadm enable-ingress-traefik my-cluster
Validate with Smoke Test
The smoke test validates both ingress routing and CRIU checkpoint/restore end-to-end:
clouder kubeadm smoke-test my-cluster
Tear Down
# Remove the ingress controller and load balancer first
clouder kubeadm disable-ingress-nginx my-cluster
# Delete all VMs and Azure resources
clouder kubeadm vm-terminate my-cluster
Related: clouder kubectl
Run kubectl commands using the persisted kubeconfig for a cluster.
clouder kubectl <name> <kubectl-args...>
| Argument | Description |
|---|---|
name (required) | Cluster name (kubeconfig must exist in ~/.clouder/kubeconfigs/) |
kubectl-args | Any kubectl arguments (e.g. get nodes, apply -f ...) |
Example:
clouder kubectl my-cluster get nodes
clouder kubectl my-cluster get pods -A
clouder kubectl my-cluster apply -f deployment.yaml
clouder kubectl my-cluster logs my-pod
This is equivalent to running:
kubectl --kubeconfig=~/.clouder/kubeconfigs/kubeconfig-my-cluster get nodes
Related: clouder helm
Run helm commands using the persisted kubeconfig for a cluster.
clouder helm <name> <helm-args...>
| Argument | Description |
|---|---|
name (required) | Cluster name (kubeconfig must exist in ~/.clouder/kubeconfigs/) |
helm-args | Any helm arguments (e.g. list -A, install ...) |
Example:
clouder helm my-cluster list -A
clouder helm my-cluster install my-release my-chart
clouder helm my-cluster upgrade my-release my-chart
clouder helm my-cluster uninstall my-release
This is equivalent to running:
helm --kubeconfig=~/.clouder/kubeconfigs/kubeconfig-my-cluster list -A
Networking
The setup command installs Flannel as the CNI with --pod-network-cidr=10.244.0.0/16. Flannel provides VXLAN-based overlay networking that works reliably on Azure VNets where all nodes share the same subnet.
Calico CNI code is preserved in the codebase (commented out in kubeadm.py) for future use. Calico's default VXLANCrossSubnet mode uses direct routing when nodes are on the same subnet, which requires IP forwarding enabled on Azure NICs. Flannel's VXLAN overlay avoids this issue entirely.
CRIU Support
The setup command configures every node for CRIU checkpoint/restore:
- containerd 2.x from Docker's official apt repository (supports CRI
CheckpointContainermethod) - SystemdCgroup enabled in containerd config
- CRIU installed and verified (
criu check) - buildah installed (converts checkpoint archives into OCI images for restore)
ContainerCheckpointfeature gate enabled on all kubelets- Kubelet API accessible on port 10250 (NSG rule)
CRIU checkpoint requires containerd 2.0+. The Ubuntu apt containerd package provides 1.7.x which does NOT support checkpointing. Clouder's setup installs containerd.io from Docker's repo to get 2.x.
If your cluster was created with an older version of Clouder (containerd 1.7.x), you must recreate it:
clouder kubeadm vm-terminate my-cluster
clouder kubeadm vm-create my-cluster
clouder kubeadm setup my-cluster
An in-place upgrade from containerd 1.7 to 2.x is not supported.
Storage
The setup command installs the Azure Disk CSI driver and the Azure File CSI driver for dynamic persistent volume provisioning. The Disk CSI driver allows pods to request PersistentVolumeClaim resources backed by Azure Managed Disks. The File CSI driver enables NFS-based shared filesystem provisioning and setup creates the azure-nfs StorageClass with the cluster's Azure subscription, resource group, and location baked in (required on kubeadm; on AKS these are auto-detected from IMDS).
What gets configured:
| Component | Description |
|---|---|
/etc/kubernetes/azure.json | Azure cloud-provider config deployed to all nodes |
azure-cloud-provider secret | Kubernetes secret in kube-system for the CSI controller |
| Azure Disk CSI driver v1.30.3 | Controller + node DaemonSet with snapshot support |
| Azure File CSI driver v1.30.6 | Controller + node DaemonSet for NFS file shares |
managed-csi StorageClass | Default StorageClass using StandardSSD_LRS, WaitForFirstConsumer binding, volume expansion enabled |
azure-nfs StorageClass | NFS StorageClass for shared filesystem (file.csi.azure.com, Premium_LRS) with Azure params |
Service principal:
The CSI driver requires an Azure service principal with Contributor access to the cluster's resource group. Clouder will:
- Use existing
AZURE_TENANT_ID,AZURE_CLIENT_ID,AZURE_CLIENT_SECRETenvironment variables if set. - Otherwise, auto-create a new SP (
clouder-<name>-csi) scoped to the resource group viaaz ad sp create-for-rbac. - Skip the step with a warning if credentials cannot be obtained.
Manual installation (if Step 6 was skipped):
# 1. Create a service principal
az ad sp create-for-rbac --name clouder-<name>-csi \
--role Contributor \
--scopes /subscriptions/<sub-id>/resourceGroups/<rg>
# 2. Create /etc/kubernetes/azure.json on every node (see Azure Disk CSI docs)
# 3. Install the CSI drivers
curl -skSL https://raw.githubusercontent.com/kubernetes-sigs/azuredisk-csi-driver/v1.30.3/deploy/install-driver.sh | bash -s v1.30.3 snapshot --
curl -skSL https://raw.githubusercontent.com/kubernetes-sigs/azurefile-csi-driver/v1.30.6/deploy/install-driver.sh | bash -s v1.30.6 --
# 4. Create the StorageClasses
kubectl apply -f - <<EOF
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: managed-csi
annotations:
storageclass.kubernetes.io/is-default-class: "true"
provisioner: disk.csi.azure.com
parameters:
skuName: StandardSSD_LRS
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: azure-nfs
provisioner: file.csi.azure.com
allowVolumeExpansion: true
parameters:
shareName: shared-storage
skuName: Premium_LRS
protocol: nfs
subscriptionID: "<sub-id>"
resourceGroup: "<rg>"
location: "<region>"
EOF
Use smoke-test to validate the full checkpoint/restore pipeline and ingress load balancer.
See the CRIU documentation for checkpoint and restore workflows.
All VMs are on the same subnet (10.0.0.0/24), so kubeadm nodes discover each other using private IPs. The setup command handles this automatically.
Currently only supported for Azure. OVH support is planned.