Skip to main content

clouder kubeadm

Provision, set up, and manage kubeadm Kubernetes clusters with CRIU support. Handles the full lifecycle: VM creation, software installation, cluster initialization, CRIU configuration, testing, kubeconfig retrieval, and cluster teardown.

Overview

# Create the VMs (1 master + 1 worker)
clouder kubeadm vm-create --workers 1 k1

# Set up Kubernetes (containerd 2.x, CRIU, buildah, kubeadm, Flannel CNI, feature gates on all nodes)
clouder kubeadm setup k1

# Fetch the kubeconfig to ~/.clouder/kubeconfig-k1
clouder kubeadm get-config k1

# Verify
clouder kubectl k1 get nodes
clouder kubectl k1 get pods -A

# Enable ingress with Traefik controller and Azure Load Balancer
clouder kubeadm enable-ingress-traefik k1
clouder helm k1 ls -A

# Run the smoke test (Ingress + CRIU validation)
clouder kubeadm smoke-test k1

# Disable ingress
clouder kubeadm disable-ingress-traefik k1

# Tear down the cluster and delete all VMs and Azure resources
clouder kubeadm vm-terminate k1

Commands

clouder kubeadm vm-create

Create VMs for a kubeadm cluster: 1 master + N worker nodes on the same subnet.

clouder kubeadm vm-create <name> [OPTIONS]
ArgumentDescription
name (required)Cluster name (used as prefix for all VMs)
OptionShortDescription
--workers-wNumber of worker nodes (default: 3)
--region-rAzure region
--resource-group-gResource group
--master-sizeVM size for the master node (default: Standard_B4ms)
--node-sizeVM size for worker nodes (default: Standard_B4ms)
--os-disk-sizeOS disk size in GB (default: 100, min 30)
--admin-userAdmin username (default: azureuser)
--imageImage: Ubuntu2204, Ubuntu2404, Debian12 (default: Ubuntu2204)

What gets created:

ResourceNameDescription
VNet<name>-vnetVirtual network with 10.0.0.0/16 address space
Subnet<name>-subnetSubnet with 10.0.0.0/24 range
NSG<name>-nsgNetwork Security Group with rules (see below)
Master VM<name>-masterControl plane node
Worker VMs<name>-node-1, <name>-node-2, ...Worker nodes

NSG rules:

RulePortSourcePurpose
AllowSSH22AnySSH access
AllowK8sAPI6443AnyKubernetes API server
AllowKubelet10250VNet (10.0.0.0/16)Kubelet communication
AllowNodePorts30000-32767AnyKubernetes NodePort services
AllowHTTP80AnyHTTP ingress traffic
AllowHTTPS443AnyHTTPS ingress traffic

All VMs share the same VNet, subnet, and NSG — they can communicate with each other on private IPs while being individually accessible via public IPs.

Interactive prompts (same as clouder vm create):

  1. Resource group — pick existing or create new
  2. Region — only if creating a new resource group
  3. SSH key — pick existing or generate new

clouder kubeadm setup

Set up a kubeadm cluster with CRIU support on previously created VMs. This is the main automation command — it SSHes into each VM and runs all setup steps.

clouder kubeadm setup <name> [OPTIONS]
ArgumentDescription
name (required)Cluster name (must match the vm-create name)
OptionShortDescription
--admin-user-uSSH username (default: azureuser)
--key-iSSH key name from ~/.ssh/
--k8s-versionKubernetes version (default: 1.32)

What it does (6 steps):

StepTargetDescription
1. PrerequisitesAll nodesDisable swap, load kernel modules (overlay, br_netfilter), configure sysctl, install containerd 2.x (from Docker repo, with SystemdCgroup), install CRIU, install buildah, install kubeadm/kubelet/kubectl
2. kubeadm initMasterInitialize the control plane with --pod-network-cidr=10.244.0.0/16, enable ContainerCheckpoint feature gate
3. Install CNIMasterInstall Flannel CNI for pod networking
4. kubeadm joinWorkersJoin all worker nodes to the cluster
5. CRIU feature gatesAll nodesEnable ContainerCheckpoint feature gate on all kubelets (master + workers)
6. Azure Disk CSIAll nodes + MasterDeploy /etc/kubernetes/azure.json cloud config to all nodes, create azure-cloud-provider secret, install Azure Disk CSI driver v1.30.3, and create default managed-csi StorageClass (StandardSSD_LRS)
7. Azure File CSIMasterInstall Azure File CSI driver v1.30.6 and create azure-nfs StorageClass with subscription, resource group, and location baked in
note

Step 6 requires an Azure service principal. If AZURE_TENANT_ID, AZURE_CLIENT_ID, and AZURE_CLIENT_SECRET environment variables are set, they will be used. Otherwise, a new SP scoped to the cluster's resource group is auto-created via az ad sp create-for-rbac. If SP creation fails, the step is skipped with a warning — the cluster is still usable but dynamic persistent volume provisioning won't work until the CSI driver is installed manually.

Step 7 installs the Azure File CSI driver and creates the azure-nfs StorageClass with the Azure subscription, resource group, and location baked in from the cluster metadata. This is required because kubeadm nodes have no Azure instance metadata (unlike AKS). The plane up datalayer-shared-filesystem command relies on this StorageClass already existing.

Example:

clouder kubeadm setup my-cluster
clouder kubeadm setup my-cluster --k8s-version 1.31 --key my-cluster-key

clouder kubeadm get-config

Fetch the kubeconfig from the master node and save it locally.

clouder kubeadm get-config <name> [OPTIONS]
ArgumentDescription
name (required)Cluster name
OptionShortDescription
--admin-user-uSSH username (default: azureuser)
--key-iSSH key name from ~/.ssh/

The kubeconfig is saved to ~/.clouder/kubeconfigs/kubeconfig-<name> with 600 permissions. The internal API server IP is automatically replaced with the master's public IP for remote access.

Example:

clouder kubeadm get-config my-cluster

clouder kubeadm info

Show cluster information and next steps. Displays the current state of a kubeadm cluster and lists useful commands for day-to-day operations as well as further setup steps.

clouder kubeadm info <name>
ArgumentDescription
name (required)Cluster name

Example:

clouder kubeadm info my-cluster

clouder kubeadm scale

Scale the number of worker nodes in an existing kubeadm cluster. Compares the desired worker count with the current count, then:

  • Scale up: creates new VMs, installs prerequisites, joins them to the cluster.
  • Scale down: drains and deletes the highest-numbered worker nodes.
clouder kubeadm scale <name> [OPTIONS]
ArgumentDescription
name (required)Cluster name
OptionShortDescription
--workers (required)-wDesired number of worker nodes
--admin-user-uSSH username (default: azureuser)
--key-iSSH key name from ~/.ssh/
--force-fSkip confirmation prompt

Example:

# Scale up to 5 workers
clouder kubeadm scale my-cluster --workers 5

# Scale down to 1 worker (force, no confirmation)
clouder kubeadm scale my-cluster --workers 1 --force

clouder kubeadm vm-terminate

Terminate all VMs and networking for a kubeadm cluster. Deletes VMs, NICs, public IPs, OS disks, NSG, and VNet.

clouder kubeadm vm-terminate <name> [OPTIONS]
ArgumentDescription
name (required)Cluster name
OptionShortDescription
--force-fSkip confirmation prompt
--delete-rgAlso delete the resource group

Deletion order:

StepResourcesNotes
1VMs<name>-master, <name>-node-*
2NICsAssociated network interfaces
3Public IPsAssociated public IP addresses
4OS DisksManaged disks created with VMs
5Load Balancer<name>-lb and <name>-lb-ip (if exists)
6NSG<name>-nsg
7VNet<name>-vnet (includes subnet)
8Resource GroupOnly if --delete-rg is specified

The local kubeconfig (~/.clouder/kubeconfigs/kubeconfig-<name>) is also removed if it exists.

Example:

# Interactive confirmation
clouder kubeadm vm-terminate my-cluster

# Force delete without confirmation
clouder kubeadm vm-terminate my-cluster --force

# Also delete the resource group
clouder kubeadm vm-terminate my-cluster --force --delete-rg

clouder kubeadm smoke-test

Run an end-to-end smoke test with two phases: CRIU checkpoint/restore validation and ingress load balancer validation.

clouder kubeadm smoke-test <name> [OPTIONS]
ArgumentDescription
name (required)Cluster name
OptionShortDescription
--admin-user-uSSH username (default: azureuser)
--key-iSSH key name from ~/.ssh/
--cleanup / --no-cleanupClean up test pods after the test (default: --cleanup)

Section 1 — Ingress LB (4 steps, skipped if no ingress controller):

StepDescription
1Deploy test nginx Deployment + Service (clouder-smoke-web), show resources with kubectl get
2Create an Ingress resource routing / to the test service, show with kubectl describe
3Curl the Azure LB public IP from localhost and validate HTTP 200 (up to 5 retries, shows response headers and body)
4If DATALAYER_RUN_URL is set, curl the DNS hostname and validate HTTP 200 (shows response headers and body). Skipped with a hint if the env var is not set.
note

Section 1 is automatically skipped if no ingress controller (nginx or traefik) is detected. Run enable-ingress-nginx or enable-ingress-traefik first.

Section 2 — CRIU (8 steps):

StepDescription
1Deploy a busybox counter pod (clouder-smoke-test)
2Wait for pod to be ready
3Let the counter accumulate state for 15 seconds
4Identify which worker node the pod is running on
5Checkpoint via kubelet API using client certificates
6Verify checkpoint archive exists on the worker node
7Delete the original pod
8Import checkpoint into containerd (via buildah), deploy restored pod, validate counter
note

Section 2 requires containerd 2.0+ (installed by setup). On older containerd versions, the checkpoint API returns "method not implemented" and this section reports a warning instead of failing.

Result interpretation:

Counter after restoreMeaning
≥ checkpoint valueCRIU restore: PASSED — process continued from checkpoint
< checkpoint valueCheckpoint created, but full CRIU restore requires containerd 2.0+
Pod doesn't startCheckpoint image format not compatible with current containerd

Example:

clouder kubeadm smoke-test my-cluster

# Keep test pods for inspection
clouder kubeadm smoke-test my-cluster --no-cleanup

clouder kubeadm enable-ingress-nginx

Deploy an ingress-nginx controller and create an Azure Load Balancer with a public IP that forwards HTTP/HTTPS traffic to the cluster.

clouder kubeadm enable-ingress-nginx <name> [OPTIONS]
ArgumentDescription
name (required)Cluster name
OptionShortDescription
--admin-user-uSSH username (default: azureuser)
--key-iSSH key name from ~/.ssh/

What it does (4 steps):

StepDescription
1Deploy ingress-nginx controller in NodePort mode on the cluster
2Read the assigned NodePorts for HTTP and HTTPS
3Create Azure Load Balancer (<name>-lb) with public IP, rules 80→HTTP NodePort, 443→HTTPS NodePort
4Add all worker VM NICs to the LB backend pool

Architecture:

Internet → Azure LB (public IP)
├── :80 → worker NICs → NodePort → ingress-nginx → Ingress rules → Services
└── :443 → worker NICs → NodePort → ingress-nginx → Ingress rules → Services

No Azure Cloud Controller Manager is needed — the LB is provisioned externally via Azure SDK.

Azure resources created:

ResourceNameDescription
Public IP<name>-lb-ipStandard SKU, static allocation
Load Balancer<name>-lbStandard SKU with HTTP/HTTPS rules

Example:

clouder kubeadm enable-ingress-nginx my-cluster

# After enabling, create Ingress resources:
clouder kubectl my-cluster create ingress demo --rule='/*=my-svc:80'

clouder kubeadm disable-ingress-nginx

Remove the Azure Load Balancer and ingress-nginx controller from the cluster.

clouder kubeadm disable-ingress-nginx <name> [OPTIONS]
ArgumentDescription
name (required)Cluster name
OptionShortDescription
--admin-user-uSSH username (default: azureuser)
--key-iSSH key name from ~/.ssh/
--force-fSkip confirmation prompt

What it deletes:

  1. Azure Load Balancer (<name>-lb)
  2. LB public IP (<name>-lb-ip)
  3. ingress-nginx namespace and all its resources

Example:

clouder kubeadm disable-ingress-nginx my-cluster
clouder kubeadm disable-ingress-nginx my-cluster --force

clouder kubeadm enable-ingress-traefik

Deploy a Traefik ingress controller (via Helm) and create an Azure Load Balancer with a public IP that forwards HTTP/HTTPS traffic to the cluster.

clouder kubeadm enable-ingress-traefik <name> [OPTIONS]
ArgumentDescription
name (required)Cluster name
OptionShortDescription
--admin-user-uSSH username (default: azureuser)
--key-iSSH key name from ~/.ssh/

What it does (4 steps):

StepDescription
1Install Helm (if needed), deploy Traefik via Helm in NodePort mode
2Read the assigned NodePorts for HTTP (web) and HTTPS (websecure)
3Create Azure Load Balancer (<name>-lb) with public IP, rules 80→HTTP NodePort, 443→HTTPS NodePort
4Add all worker VM NICs to the LB backend pool

Architecture:

Internet → Azure LB (public IP)
├── :80 → worker NICs → NodePort → Traefik → IngressRoute/Ingress → Services
└── :443 → worker NICs → NodePort → Traefik → IngressRoute/Ingress → Services

Example:

clouder kubeadm enable-ingress-traefik my-cluster

# After enabling, create Ingress resources:
clouder kubectl my-cluster create ingress demo --rule='/*=my-svc:80'

clouder kubeadm disable-ingress-traefik

Remove the Azure Load Balancer and Traefik ingress controller from the cluster.

clouder kubeadm disable-ingress-traefik <name> [OPTIONS]
ArgumentDescription
name (required)Cluster name
OptionShortDescription
--admin-user-uSSH username (default: azureuser)
--key-iSSH key name from ~/.ssh/
--force-fSkip confirmation prompt

What it deletes:

  1. Azure Load Balancer (<name>-lb)
  2. LB public IP (<name>-lb-ip)
  3. Traefik Helm release and traefik namespace

Example:

clouder kubeadm disable-ingress-traefik my-cluster
clouder kubeadm disable-ingress-traefik my-cluster --force

Typical Workflow

Create a Cluster

Three commands are needed to go from zero to a running Kubernetes cluster with CRIU support:

# 1. Create the VMs (1 master + 3 workers on Azure)
clouder kubeadm vm-create my-cluster

# 2. Set up Kubernetes on all nodes
clouder kubeadm setup my-cluster

# 3. Fetch the kubeconfig
clouder kubeadm get-config my-cluster

vm-create provisions 4 VMs (Standard_B4ms, 100 GB OS disks) in Azure with a VNet, subnet, and NSG. setup installs containerd 2.x, CRIU, buildah, kubeadm/kubelet/kubectl, initializes the control plane with Flannel CNI, joins workers, enables the ContainerCheckpoint feature gate on every node, and installs the Azure Disk CSI and Azure File CSI drivers for dynamic persistent volume provisioning. get-config downloads the kubeconfig to ~/.clouder/kubeconfigs/kubeconfig-my-cluster.

Use the Cluster

# Run kubectl via the persisted kubeconfig
clouder kubectl my-cluster get nodes
clouder kubectl my-cluster get pods -A
clouder kubectl my-cluster apply -f deployment.yaml

Add an Ingress Controller

Pick one of nginx or traefik. Both create an Azure Load Balancer with a public IP:

# Option A: ingress-nginx
clouder kubeadm enable-ingress-nginx my-cluster

# Option B: traefik
clouder kubeadm enable-ingress-traefik my-cluster

Validate with Smoke Test

The smoke test validates both ingress routing and CRIU checkpoint/restore end-to-end:

clouder kubeadm smoke-test my-cluster

Tear Down

# Remove the ingress controller and load balancer first
clouder kubeadm disable-ingress-nginx my-cluster

# Delete all VMs and Azure resources
clouder kubeadm vm-terminate my-cluster

Run kubectl commands using the persisted kubeconfig for a cluster.

clouder kubectl <name> <kubectl-args...>
ArgumentDescription
name (required)Cluster name (kubeconfig must exist in ~/.clouder/kubeconfigs/)
kubectl-argsAny kubectl arguments (e.g. get nodes, apply -f ...)

Example:

clouder kubectl my-cluster get nodes
clouder kubectl my-cluster get pods -A
clouder kubectl my-cluster apply -f deployment.yaml
clouder kubectl my-cluster logs my-pod

This is equivalent to running:

kubectl --kubeconfig=~/.clouder/kubeconfigs/kubeconfig-my-cluster get nodes

Run helm commands using the persisted kubeconfig for a cluster.

clouder helm <name> <helm-args...>
ArgumentDescription
name (required)Cluster name (kubeconfig must exist in ~/.clouder/kubeconfigs/)
helm-argsAny helm arguments (e.g. list -A, install ...)

Example:

clouder helm my-cluster list -A
clouder helm my-cluster install my-release my-chart
clouder helm my-cluster upgrade my-release my-chart
clouder helm my-cluster uninstall my-release

This is equivalent to running:

helm --kubeconfig=~/.clouder/kubeconfigs/kubeconfig-my-cluster list -A

Networking

The setup command installs Flannel as the CNI with --pod-network-cidr=10.244.0.0/16. Flannel provides VXLAN-based overlay networking that works reliably on Azure VNets where all nodes share the same subnet.

note

Calico CNI code is preserved in the codebase (commented out in kubeadm.py) for future use. Calico's default VXLANCrossSubnet mode uses direct routing when nodes are on the same subnet, which requires IP forwarding enabled on Azure NICs. Flannel's VXLAN overlay avoids this issue entirely.

CRIU Support

The setup command configures every node for CRIU checkpoint/restore:

  • containerd 2.x from Docker's official apt repository (supports CRI CheckpointContainer method)
  • SystemdCgroup enabled in containerd config
  • CRIU installed and verified (criu check)
  • buildah installed (converts checkpoint archives into OCI images for restore)
  • ContainerCheckpoint feature gate enabled on all kubelets
  • Kubelet API accessible on port 10250 (NSG rule)
caution

CRIU checkpoint requires containerd 2.0+. The Ubuntu apt containerd package provides 1.7.x which does NOT support checkpointing. Clouder's setup installs containerd.io from Docker's repo to get 2.x.

If your cluster was created with an older version of Clouder (containerd 1.7.x), you must recreate it:

clouder kubeadm vm-terminate my-cluster
clouder kubeadm vm-create my-cluster
clouder kubeadm setup my-cluster

An in-place upgrade from containerd 1.7 to 2.x is not supported.

Storage

The setup command installs the Azure Disk CSI driver and the Azure File CSI driver for dynamic persistent volume provisioning. The Disk CSI driver allows pods to request PersistentVolumeClaim resources backed by Azure Managed Disks. The File CSI driver enables NFS-based shared filesystem provisioning and setup creates the azure-nfs StorageClass with the cluster's Azure subscription, resource group, and location baked in (required on kubeadm; on AKS these are auto-detected from IMDS).

What gets configured:

ComponentDescription
/etc/kubernetes/azure.jsonAzure cloud-provider config deployed to all nodes
azure-cloud-provider secretKubernetes secret in kube-system for the CSI controller
Azure Disk CSI driver v1.30.3Controller + node DaemonSet with snapshot support
Azure File CSI driver v1.30.6Controller + node DaemonSet for NFS file shares
managed-csi StorageClassDefault StorageClass using StandardSSD_LRS, WaitForFirstConsumer binding, volume expansion enabled
azure-nfs StorageClassNFS StorageClass for shared filesystem (file.csi.azure.com, Premium_LRS) with Azure params

Service principal:

The CSI driver requires an Azure service principal with Contributor access to the cluster's resource group. Clouder will:

  1. Use existing AZURE_TENANT_ID, AZURE_CLIENT_ID, AZURE_CLIENT_SECRET environment variables if set.
  2. Otherwise, auto-create a new SP (clouder-<name>-csi) scoped to the resource group via az ad sp create-for-rbac.
  3. Skip the step with a warning if credentials cannot be obtained.

Manual installation (if Step 6 was skipped):

# 1. Create a service principal
az ad sp create-for-rbac --name clouder-<name>-csi \
--role Contributor \
--scopes /subscriptions/<sub-id>/resourceGroups/<rg>

# 2. Create /etc/kubernetes/azure.json on every node (see Azure Disk CSI docs)
# 3. Install the CSI drivers
curl -skSL https://raw.githubusercontent.com/kubernetes-sigs/azuredisk-csi-driver/v1.30.3/deploy/install-driver.sh | bash -s v1.30.3 snapshot --
curl -skSL https://raw.githubusercontent.com/kubernetes-sigs/azurefile-csi-driver/v1.30.6/deploy/install-driver.sh | bash -s v1.30.6 --

# 4. Create the StorageClasses
kubectl apply -f - <<EOF
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: managed-csi
annotations:
storageclass.kubernetes.io/is-default-class: "true"
provisioner: disk.csi.azure.com
parameters:
skuName: StandardSSD_LRS
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: azure-nfs
provisioner: file.csi.azure.com
allowVolumeExpansion: true
parameters:
shareName: shared-storage
skuName: Premium_LRS
protocol: nfs
subscriptionID: "<sub-id>"
resourceGroup: "<rg>"
location: "<region>"
EOF

Use smoke-test to validate the full checkpoint/restore pipeline and ingress load balancer.

See the CRIU documentation for checkpoint and restore workflows.

tip

All VMs are on the same subnet (10.0.0.0/24), so kubeadm nodes discover each other using private IPs. The setup command handles this automatically.

note

Currently only supported for Azure. OVH support is planned.