Skip to main content

clouder kubeadm

Provision, set up, and manage Kubeadm Kubernetes clusters. Handles the full lifecycle: VM creation, software installation, cluster initialization, CRIU configuration, testing, kubeconfig retrieval, and cluster teardown.

Alias: clouder k (for example, clouder k ls, clouder k create my-cluster).

Overview

# Create the VMs (1 master + 1 worker)
clouder kubeadm create --workers 1 my-cluster

# List locally known kubeadm clusters
clouder kubeadm ls

# Include node counts/readiness details (slower)
clouder kubeadm ls --details

# Set up Kubernetes (containerd 2.x, CRIU, buildah, Kubeadm, Flannel CNI, feature gates on all nodes)
clouder kubeadm setup my-cluster

# Fetch the kubeconfig to ~/.clouder/kubeadm/r1/kubeconfig
clouder kubeadm get-config my-cluster

# Select cluster kubeconfig for current shell usage (fetches if missing)
clouder kubeadm use my-cluster

# Verify
clouder kubectl my-cluster get nodes
clouder kubectl my-cluster get pods -A

# Enable ingress with Traefik controller and cloud-specific load balancer setup
clouder kubeadm enable-ingress-traefik my-cluster
clouder helm my-cluster ls -A

# Run the smoke test (Ingress + CRIU validation)
clouder kubeadm smoke-test my-cluster

# Upgrade kubelet/Kubeadm/kubectl on all nodes (e.g. after bumping K8S_VERSION)
clouder kubeadm upgrade-kubelet my-cluster

# Prune unhealthy worker Kubernetes nodes/VMs (interactive confirmation)
clouder kubeadm prune my-cluster

# Repair worker VMs that exist in cloud inventory but are missing from kubectl nodes
clouder kubeadm repair my-cluster

# Disable ingress
clouder kubeadm disable-ingress-traefik my-cluster

# Tear down the cluster and delete all cloud resources
clouder kubeadm terminate my-cluster

Commands

By default, commands below accept <name>. If a default is configured with clouder kubeadm set-default <name>, you can omit <name> and Clouder will use the persisted default cluster.

Most operational commands also accept:

  • --cloud azure|aws to force the target provider when your current context is different.
  • If omitted, Clouder resolves cloud from cluster metadata first, then current context.

clouder kubeadm ls

List kubeadm clusters known locally from ~/.clouder/kubeadm/<name>/kubeadm.json.

clouder kubeadm ls

Optional:

  • --cloud azure|aws to filter clusters by provider.

The command prints a table with cluster name, cloud, region, resource group, whether kubeconfig exists, and whether setup is marked complete.

Use --details to include master/worker counts and readiness state:

clouder kubeadm ls --details

clouder kubeadm set-default

Persist the default kubeadm cluster name in ~/.clouder/clouder.yaml.

clouder kubeadm set-default <name>

Optional:

  • --cloud azure|aws to validate the selected cluster matches the expected provider.

Examples:

clouder kubeadm set-default my-cluster

# Now cluster name can be omitted on kubeadm commands that operate on existing clusters
clouder kubeadm info
clouder kubeadm get-config
clouder kubeadm scale --workers 5

clouder kubeadm create

Create VMs for a Kubeadm cluster: 1 master + N worker nodes on the same subnet.

clouder kubeadm create <name> [OPTIONS]
ArgumentDescription
name (required)Cluster name (used as prefix for all VMs)
OptionShortDescription
--cloudTarget cloud provider (azure or aws)
--workers-wNumber of worker nodes (default: 3)
--region-rCloud region (e.g. eastus, us-east-1)
--resource-group-gResource group (Azure only)
--master-sizeVM size for the master node (default: Standard_B4ms)
--node-sizeVM size for worker nodes (default: Standard_B4ms)
--os-disk-sizeOS disk size in GB (default: 100, min 30)
--admin-userAdmin username (default: azureuser on Azure, ubuntu on AWS)
--imageImage: Ubuntu2204, Ubuntu2404, Debian12 (Azure only, default: Ubuntu2204)

What gets created:

ResourceNameDescription
VNet<name>-vnetVirtual network with 10.0.0.0/16 address space
Subnet<name>-subnetSubnet with 10.0.0.0/24 range
NSG<name>-nsgNetwork Security Group with rules (see below)
Master VM<name>-masterControl plane node
Worker VMs<name>-node-1, <name>-node-2, ...Worker nodes

NSG rules:

RulePortSourcePurpose
AllowSSH22AnySSH access
AllowK8sAPI6443AnyKubernetes API server
AllowKubelet10250VNet (10.0.0.0/16)Kubelet communication
AllowNodePorts30000-32767AnyKubernetes NodePort services
AllowHTTP80AnyHTTP ingress traffic
AllowHTTPS443AnyHTTPS ingress traffic

All VMs share the same VNet, subnet, and NSG — they can communicate with each other on private IPs while being individually accessible via public IPs.

After create, Clouder persists cluster metadata to ~/.clouder/kubeadm/<name>/kubeadm.json for later commands (setup, scale, info, cloud helper commands). The file includes, at minimum:

  • cloud (azure or aws)
  • region
  • cloud context identifiers (subscription_id for Azure, account_id for AWS)
  • requested node shape/count (requested_workers, master_size, node_size)
  • image/AMI details and admin user
  • master/worker node names + IPs
  • network resource identifiers

Interactive prompts (same as clouder vm create):

  1. Resource group — pick existing or create new
  2. Region — only if creating a new resource group
  3. SSH key — pick existing or generate new

Recent cloud-specific create safeguards:

  • AWS now validates the selected context account against the active AWS credentials before provisioning.
  • AWS create uses deterministic region scoping (defaults to us-east-1 when --region is omitted).
  • AWS validates EC2 key pair selection in-region before provisioning starts.
  • AWS create rolls back VPC/subnet/route table/IGW/security group resources if VM creation fails mid-flight.
  • --os-disk-size is validated with a minimum of 30 GB before provisioning starts.

clouder kubeadm setup

Set up a Kubeadm cluster on previously created VMs. This is the main automation command — it SSHes into each VM and runs all setup steps.

clouder kubeadm setup <name> [OPTIONS]
ArgumentDescription
name (required)Cluster name (must match the create name)
OptionShortDescription
--cloudTarget cloud provider (azure or aws)
--admin-user-uSSH username (default: ubuntu)
--key-iSSH key name from ~/.ssh/
--k8s-versionKubernetes version (default: 1.32)
--node-labelNode label (key=value) applied after each worker becomes Ready. Repeatable or comma-separated. Defaults to runtime labels.

Default worker labels applied after each node becomes Ready:

  • role.datalayer.io/runtime=true
  • node.datalayer.io/variant=medium
  • xpu.datalayer.io/cpu=true

What it does (6 steps):

StepTargetDescription
1. PrerequisitesAll nodesDisable swap, load kernel modules (overlay, br_netfilter), configure sysctl, install containerd 2.x (from Docker repo, with SystemdCgroup), install CRIU, install buildah, install Kubeadm/kubelet/kubectl
2. Kubeadm initMasterInitialize the control plane with --pod-network-cidr=10.244.0.0/16, enable ContainerCheckpoint feature gate
3. Install CNIMasterInstall Flannel CNI for pod networking
4. Kubeadm joinWorkersJoin all worker nodes to the cluster
5. CRIU feature gatesAll nodesEnable ContainerCheckpoint feature gate on all kubelets (master + workers)
6. Cloud storage and load balancer bootstrapCloud-specificAzure: deploy cloud config, install Azure Disk CSI (managed-csi) and Azure File CSI (azure-nfs). AWS: install AWS EBS CSI with default gp3 StorageClass and install AWS Load Balancer Controller (instance profile preferred, static credentials fallback).
note

Azure storage setup requires a service principal. If AZURE_TENANT_ID, AZURE_CLIENT_ID, and AZURE_CLIENT_SECRET are set, they are used. Otherwise, Clouder attempts to create a scoped SP automatically.

AWS storage setup prefers EC2 instance profiles on node VMs. If no instance profile is attached, Clouder falls back to the active AWS credentials in the current session.

On AWS, load balancer setup installs AWS Load Balancer Controller and expects node IAM permissions for ELB operations.

AWS cloud integration details:

  • Storage: installs AWS EBS CSI and configures a default gp3 StorageClass.
  • Load balancer: installs AWS Load Balancer Controller with cluster-aware settings.
  • Authentication preference: EC2 instance profile first, then fallback to active AWS credentials when needed.

Example:

clouder kubeadm setup my-cluster
clouder kubeadm setup my-cluster --k8s-version 1.31 --key my-cluster-key

clouder kubeadm get-config

Fetch the kubeconfig from the master node and save it locally.

clouder kubeadm get-config <name> [OPTIONS]
ArgumentDescription
name (required)Cluster name
OptionShortDescription
--cloudTarget cloud provider (azure or aws)
--admin-user-uSSH username (default: azureuser)
--key-iSSH key name from ~/.ssh/

The kubeconfig is saved to ~/.clouder/kubeadm/<name>/kubeconfig with 600 permissions. The internal API server IP is automatically replaced with the master's public IP for remote access.

Additionally, the kubelet client certificates (apiserver-kubelet-client.crt and .key) are fetched from the master and saved to ~/.clouder/kubeadm/<name>/. These are needed for CRIU checkpoint API calls via the kubelet.

Example:

clouder kubeadm get-config my-cluster
# Output:
# Kubeconfig saved to ~/.clouder/kubeadm/my-cluster/kubeconfig
# Kubelet client cert saved to ~/.clouder/kubeadm/my-cluster/apiserver-kubelet-client.crt
# Kubelet client key saved to ~/.clouder/kubeadm/my-cluster/apiserver-kubelet-client.key
#
# Usage:
# export KUBECONFIG=~/.clouder/kubeadm/my-cluster/kubeconfig
# export KUBELET_CLIENT_CERT=~/.clouder/kubeadm/my-cluster/apiserver-kubelet-client.crt
# export KUBELET_CLIENT_KEY=~/.clouder/kubeadm/my-cluster/apiserver-kubelet-client.key
# kubectl get nodes

clouder kubeadm terminate

Terminate all cluster resources for a kubeadm deployment.

clouder kubeadm terminate <name>

Cloud-specific behavior:

  • Azure: deletes VM-attached resources, load balancer artifacts, NSG/VNet, and optionally the resource group.
  • AWS: uses the cluster metadata region to discover and terminate EC2 instances, then tears down managed VPC networking resources.

This ensures cleanup is scoped to the correct cloud primitives and avoids cross-region drift on AWS.


clouder kubeadm use

Select a kubeadm cluster kubeconfig for local kubectl usage.

clouder kubeadm use [name] [OPTIONS]
ArgumentDescription
name (optional)Cluster name. If omitted, Clouder uses the default kubeadm cluster (from set-default).
OptionDescription
--cloudTarget cloud provider (azure or aws)
--admin-user, -uSSH username used only when kubeconfig must be fetched from master (default: azureuser)
--key, -iSSH key name from ~/.ssh/ used only when kubeconfig must be fetched
--print-exportPrint only export KUBECONFIG=... (for eval)

Behavior:

  • If ~/.clouder/kubeadm/<name>/kubeconfig exists, it is used directly.
  • If it does not exist, Clouder fetches it from the master (same behavior as get-config), then uses it.
  • The command prints an export line to apply in your shell.

Examples:

# Use explicit cluster
clouder kubeadm use my-cluster

# Use default cluster (set with set-default)
clouder kubeadm use

# Shell-friendly one-liner
eval "$(clouder kubeadm use my-cluster --print-export)"

clouder kubeadm info

Show cluster information and next steps. Displays the current state of a Kubeadm cluster and lists useful commands for day-to-day operations as well as further setup steps.

clouder kubeadm info [name]

Optional flag: --cloud azure|aws.

ArgumentDescription
name (optional)Cluster name. If omitted, Clouder uses the default kubeadm cluster (from set-default).

Example:

clouder kubeadm info my-cluster

clouder kubeadm scale

Scale the number of worker nodes in an existing Kubeadm cluster. Compares the desired worker count with the current count, then:

  • Scale up: creates new VMs, installs prerequisites, upgrades kubelet/kubeadm/kubectl on each new worker, then joins workers to the cluster.
  • Scale down: removes nodes one by one using a least-loaded-first workflow with explicit completion waits.

Scale-down workflow (per node):

  1. Identify the worker node with the fewest running pods (least-loaded priority).
  2. Mark that node unschedulable (kubectl cordon) and wait until the node is cordoned.
  3. Delete all evictable pods on that node (non-DaemonSet) and wait until they are fully terminated.
  4. Remove the Kubernetes node object, then delete the Azure VM and wait until deletion is complete.

DaemonSet-managed pods are expected to remain on a cordoned node during this phase and are ignored by the completion wait.

This sequence is repeated iteratively until the desired worker count is reached, with clear logs printed for every step.

clouder kubeadm scale <name> [OPTIONS]
ArgumentDescription
name (required)Cluster name
OptionShortDescription
--cloudTarget cloud provider (azure or aws)
--workers (required)-wDesired number of worker nodes
--admin-user-uSSH username (default: azureuser)
--key-iSSH key name from ~/.ssh/
--os-disk-size-gbOS disk size in GiB for newly created workers. Larger disks increase node ephemeral-storage capacity.
--node-labelNode label (key=value) applied after each new worker becomes Ready. Repeatable or comma-separated. Defaults to runtime labels.
--force-fSkip confirmation prompt

Example:

# Scale up to 5 workers
clouder kubeadm scale my-cluster --workers 5

# Scale up with larger worker OS disks (improves ephemeral-storage capacity)
clouder kubeadm scale my-cluster --workers 5 --os-disk-size-gb 128

# Scale down to 1 worker (force, no confirmation)
clouder kubeadm scale my-cluster --workers 1 --force
note

Worker scaling automation currently supports Azure. Use --cloud azure when needed.


clouder kubeadm prune

Identify unhealthy worker Kubernetes nodes and cloud VMs, list them, and ask for confirmation before force deletion.

clouder kubeadm prune <name> [OPTIONS]
ArgumentDescription
name (required)Cluster name
OptionShortDescription
--cloudTarget cloud provider (azure or aws)
--force-fSkip confirmation prompt

What is considered unhealthy:

  • Kubernetes worker node where Ready != True
  • Azure worker VM where provisioning_state != Succeeded
  • AWS worker VM where EC2 state != running

When confirmed, Clouder force-deletes matching Kubernetes node objects and Azure VMs.

Example:

# Interactive confirmation
clouder kubeadm prune my-cluster

# No prompt
clouder kubeadm prune my-cluster --force

clouder kubeadm repair

Detect worker VMs that exist in the cluster VM inventory but are not registered in Kubernetes nodes, then reconcile each missing worker with a full node setup and join flow.

clouder kubeadm repair <name> [OPTIONS]
ArgumentDescription
name (required)Cluster name
OptionShortDescription
--admin-user-uSSH username (default: metadata value, or azureuser/ubuntu by cloud)
--key-iSSH key name from ~/.ssh/
--node-labelNode label (key=value) applied after each reconciled worker becomes Ready. Repeatable or comma-separated. Defaults to runtime labels.

What it does:

  1. Lists VM inventory for the cluster (master + workers)
  2. Lists Kubernetes nodes via kubectl get nodes on the master
  3. Finds worker VMs missing from Kubernetes registration
  4. For each missing worker, runs full reconciliation:
  • kubelet/kubeadm/kubectl upgrade
  • node prerequisites setup
  • kubeadm reset + fresh join
  • feature-gate setup
  • wait for Ready
  • apply node labels

Examples:

# Repair with default runtime labels
clouder kubeadm repair my-cluster

# Repair and apply custom labels
clouder kubeadm repair my-cluster --node-label role.datalayer.io/runtime=true --node-label node.datalayer.io/variant=large

clouder kubeadm terminate

Terminate all VMs and networking for a Kubeadm cluster. Deletes VMs, NICs, public IPs, OS disks, NSG, and VNet.

clouder kubeadm terminate <name> [OPTIONS]
ArgumentDescription
name (required)Cluster name
OptionShortDescription
--force-fSkip confirmation prompt
--delete-rgAlso delete the resource group

Deletion order:

StepResourcesNotes
1VMs<name>-master, <name>-node-*
2NICsAssociated network interfaces
3Public IPsAssociated public IP addresses
4OS DisksManaged disks created with VMs
5Load Balancer<name>-lb and <name>-lb-ip (if exists)
6NSG<name>-nsg
7VNet<name>-vnet (includes subnet)
8Resource GroupOnly if --delete-rg is specified

The local kubeconfig (~/.clouder/kubeadm/<name>/kubeconfig) is also removed if it exists.

Example:

# Interactive confirmation
clouder kubeadm terminate my-cluster

# Force delete without confirmation
clouder kubeadm terminate my-cluster --force

# Also delete the resource group
clouder kubeadm terminate my-cluster --force --delete-rg

clouder kubeadm smoke-test

Run an end-to-end smoke test with two phases: CRIU checkpoint/restore validation and ingress load balancer validation.

clouder kubeadm smoke-test <name> [OPTIONS]
ArgumentDescription
name (required)Cluster name
OptionShortDescription
--admin-user-uSSH username (default: azureuser)
--key-iSSH key name from ~/.ssh/
--cleanup / --no-cleanupClean up test pods after the test (default: --cleanup)

Section 1 — Ingress LB (4 steps, skipped if no ingress controller):

StepDescription
1Deploy test nginx Deployment + Service (clouder-smoke-web), show resources with kubectl get
2Create an Ingress resource routing / to the test service, show with kubectl describe
3Curl the Azure LB public IP from localhost and validate HTTP 200 (up to 5 retries, shows response headers and body)
4If DATALAYER_RUN_URL is set, curl the DNS hostname and validate HTTP 200 (shows response headers and body). Skipped with a hint if the env var is not set.
note

Section 1 is automatically skipped if no ingress controller (nginx or traefik) is detected. Run enable-ingress-nginx or enable-ingress-traefik first.

Section 2 — CRIU (8 steps):

StepDescription
1Deploy a busybox counter pod (clouder-smoke-test)
2Wait for pod to be ready
3Let the counter accumulate state for 15 seconds
4Identify which worker node the pod is running on
5Checkpoint via kubelet API using client certificates
6Verify checkpoint archive exists on the worker node
7Delete the original pod
8Import checkpoint into containerd (via buildah), deploy restored pod, validate counter
note

Section 2 requires containerd 2.0+ (installed by setup). On older containerd versions, the checkpoint API returns "method not implemented" and this section reports a warning instead of failing.

Result interpretation:

Counter after restoreMeaning
≥ checkpoint valueCRIU restore: PASSED — process continued from checkpoint
< checkpoint valueCheckpoint created, but full CRIU restore requires containerd 2.0+
Pod doesn't startCheckpoint image format not compatible with current containerd

Example:

clouder kubeadm smoke-test my-cluster

# Keep test pods for inspection
clouder kubeadm smoke-test my-cluster --no-cleanup

clouder kubeadm enable-ingress-nginx

Deploy an ingress-nginx controller and create an Azure Load Balancer with a public IP that forwards HTTP/HTTPS traffic to the cluster.

clouder kubeadm enable-ingress-nginx <name> [OPTIONS]
ArgumentDescription
name (required)Cluster name
OptionShortDescription
--admin-user-uSSH username (default: azureuser)
--key-iSSH key name from ~/.ssh/

What it does (6 steps):

StepDescription
1Deploy ingress-nginx controller in NodePort mode on the cluster
2Read the assigned NodePorts for HTTP and HTTPS
3Create Azure Load Balancer (<name>-lb) and wait for the public IP assignment
4Add all worker VM NICs to the LB backend pool
5Ask for the DNS hostname mapped to the LB IP and test hostname resolution
6Persist hostname in kubeadm.json only after DNS resolves to the LB IP

Architecture:

Internet → Azure LB (public IP)
├── :80 → worker NICs → NodePort → ingress-nginx → Ingress rules → Services
└── :443 → worker NICs → NodePort → ingress-nginx → Ingress rules → Services

No Azure Cloud Controller Manager is needed — the LB is provisioned externally via Azure SDK.

Azure resources created:

ResourceNameDescription
Public IP<name>-lb-ipStandard SKU, static allocation
Load Balancer<name>-lbStandard SKU with HTTP/HTTPS rules

Example:

clouder kubeadm enable-ingress-nginx my-cluster

# After enabling, create Ingress resources:
clouder kubectl my-cluster create ingress demo --rule='/*=my-svc:80'

clouder kubeadm disable-ingress-nginx

Remove the Azure Load Balancer and ingress-nginx controller from the cluster.

clouder kubeadm disable-ingress-nginx <name> [OPTIONS]
ArgumentDescription
name (required)Cluster name
OptionShortDescription
--admin-user-uSSH username (default: azureuser)
--key-iSSH key name from ~/.ssh/
--force-fSkip confirmation prompt

What it deletes:

  1. Azure Load Balancer (<name>-lb)
  2. LB public IP (<name>-lb-ip)
  3. ingress-nginx namespace and all its resources

Example:

clouder kubeadm disable-ingress-nginx my-cluster
clouder kubeadm disable-ingress-nginx my-cluster --force

clouder kubeadm enable-ingress-traefik

Deploy a Traefik ingress controller (via Helm) and create an Azure Load Balancer with a public IP that forwards HTTP/HTTPS traffic to the cluster.

clouder kubeadm enable-ingress-traefik <name> [OPTIONS]
ArgumentDescription
name (required)Cluster name
OptionShortDescription
--admin-user-uSSH username (default: azureuser)
--key-iSSH key name from ~/.ssh/

What it does (6 steps):

StepDescription
1Install Helm (if needed), deploy Traefik via Helm in NodePort mode
2Read the assigned NodePorts for HTTP (web) and HTTPS (websecure)
3Create Azure Load Balancer (<name>-lb) and wait for the public IP assignment
4Add all worker VM NICs to the LB backend pool
5Ask for the DNS hostname mapped to the LB IP and test hostname resolution
6Persist hostname in kubeadm.json only after DNS resolves to the LB IP

Architecture:

Internet → Azure LB (public IP)
├── :80 → worker NICs → NodePort → Traefik → IngressRoute/Ingress → Services
└── :443 → worker NICs → NodePort → Traefik → IngressRoute/Ingress → Services

Example:

clouder kubeadm enable-ingress-traefik my-cluster

# After enabling, create Ingress resources:
clouder kubectl my-cluster create ingress demo --rule='/*=my-svc:80'

clouder kubeadm disable-ingress-traefik

Remove the Azure Load Balancer and Traefik ingress controller from the cluster.

clouder kubeadm disable-ingress-traefik <name> [OPTIONS]
ArgumentDescription
name (required)Cluster name
OptionShortDescription
--admin-user-uSSH username (default: azureuser)
--key-iSSH key name from ~/.ssh/
--force-fSkip confirmation prompt

What it deletes:

  1. Azure Load Balancer (<name>-lb)
  2. LB public IP (<name>-lb-ip)
  3. Traefik Helm release and traefik namespace

Example:

clouder kubeadm disable-ingress-traefik my-cluster
clouder kubeadm disable-ingress-traefik my-cluster --force

clouder kubeadm upgrade-kubelet

Upgrade kubelet, Kubeadm, and kubectl on all nodes of an existing cluster to the target Kubernetes version (K8S_VERSION, currently 1.32). This is needed when the cluster was originally provisioned with an older Kubernetes version and requires features available in newer kubelet releases (e.g. the checkpoint timeout query parameter added in Kubernetes 1.30+).

clouder kubeadm upgrade-kubelet <name> [OPTIONS]
ArgumentDescription
name (required)Cluster name (must match the create name)
OptionShortDescription
--admin-user-uSSH username (default: azureuser)
--key-iSSH key name from ~/.ssh/

What it does (per node):

StepDescription
1Update the Kubernetes apt repository to the target version
2Unhold kubelet, Kubeadm, kubectl packages
3Upgrade all three packages to the latest patch of vK8S_VERSION
4Re-hold the packages to prevent accidental upgrades
5Run systemctl daemon-reload && systemctl restart kubelet

Nodes are upgraded sequentially (master first, then workers). Pods are NOT drained — for zero-downtime upgrades, drain nodes manually before running this command.

caution

This command upgrades the kubelet binary and restarts it. It does not run Kubeadm upgrade apply (control plane component upgrade). For a full Kubernetes version upgrade, run Kubeadm upgrade apply on the master node separately.

Example:

# Upgrade all nodes to the target K8s version
clouder kubeadm upgrade-kubelet my-cluster

# With a specific SSH key
clouder kubeadm upgrade-kubelet my-cluster --key my-cluster-key

Typical Workflow

Create a Cluster

Three commands are needed to go from zero to a running Kubernetes cluster:

# 1. Create the VMs (1 master + 3 workers on Azure)
clouder kubeadm create my-cluster

# 2. Set up Kubernetes on all nodes
clouder kubeadm setup my-cluster

# 3. Fetch the kubeconfig
clouder kubeadm get-config my-cluster

create provisions 4 VMs (Standard_B4ms, 100 GB OS disks) in Azure with a VNet, subnet, and NSG. setup installs containerd 2.x, CRIU, buildah, Kubeadm/kubelet/kubectl, initializes the control plane with Flannel CNI, joins workers, enables the ContainerCheckpoint feature gate on every node, and installs the Azure Disk CSI and Azure File CSI drivers for dynamic persistent volume provisioning. get-config downloads the kubeconfig to ~/.clouder/kubeadm/my-cluster/kubeconfig.

Use the Cluster

# Run kubectl via the persisted kubeconfig
clouder kubectl my-cluster get nodes
clouder kubectl my-cluster get pods -A
clouder kubectl my-cluster apply -f deployment.yaml

# If a default kubeadm cluster is configured, name can be omitted
clouder kubectl get nodes

Add an Ingress Controller

Pick one of nginx or traefik. Both create an Azure Load Balancer with a public IP:

# Option A: ingress-nginx
clouder kubeadm enable-ingress-nginx my-cluster

# Option B: traefik
clouder kubeadm enable-ingress-traefik my-cluster

Validate with Smoke Test

The smoke test validates both ingress routing and CRIU checkpoint/restore end-to-end:

clouder kubeadm smoke-test my-cluster

Tear Down

# Remove the ingress controller and load balancer first
clouder kubeadm disable-ingress-nginx my-cluster

# Delete all VMs and Azure resources
clouder kubeadm terminate my-cluster

Networking Details

The setup command installs Flannel as the CNI with --pod-network-cidr=10.244.0.0/16. Flannel provides VXLAN-based overlay networking that works reliably on Azure VNets where all nodes share the same subnet.

note

Calico CNI code is preservRelated:ed in the codebase (commented out in Kubeadm.py) for future use. Calico's default VXLANCrossSubnet mode uses direct routing when nodes are on the same subnet, which requires IP forwarding enabled on Azure NICs. Flannel's VXLAN overlay avoids this issue entirely.

CRIU Support Details

The setup command configures every node for CRIU checkpoint/restore:

  • containerd 2.x from Docker's official apt repository (supports CRI CheckpointContainer method)
  • SystemdCgroup enabled in containerd config
  • CRIU installed and verified (criu check)
  • buildah installed (converts checkpoint archives into OCI images for restore)
  • ContainerCheckpoint feature gate enabled on all kubelets
  • Kubelet API accessible on port 10250 (NSG rule)
caution

CRIU checkpoint requires containerd 2.0+. The Ubuntu apt containerd package provides 1.7.x which does NOT support checkpointing. Clouder's setup installs containerd.io from Docker's repo to get 2.x.

If your cluster was created with an older version of Clouder (containerd 1.7.x), you must recreate it:

clouder kubeadm terminate my-cluster
clouder kubeadm create my-cluster
clouder kubeadm setup my-cluster

An in-place upgrade from containerd 1.7 to 2.x is not supported.

Storage Details

The setup command installs the Azure Disk CSI driver and the Azure File CSI driver for dynamic persistent volume provisioning. The Disk CSI driver allows pods to request PersistentVolumeClaim resources backed by Azure Managed Disks. The File CSI driver enables NFS-based shared filesystem provisioning and setup creates the azure-nfs StorageClass with the cluster's Azure subscription, resource group, and location baked in (required on Kubeadm; on AKS these are auto-detected from IMDS).

What gets configured:

ComponentDescription
/etc/kubernetes/azure.jsonAzure cloud-provider config deployed to all nodes
azure-cloud-provider secretKubernetes secret in kube-system for the CSI controller
Azure Disk CSI driver v1.30.3Controller + node DaemonSet with snapshot support
Azure File CSI driver v1.30.6Controller + node DaemonSet for NFS file shares
managed-csi StorageClassDefault StorageClass using StandardSSD_LRS, WaitForFirstConsumer binding, volume expansion enabled
azure-nfs StorageClassNFS StorageClass for shared filesystem (file.csi.azure.com, Premium_LRS) with Azure params

Service principal:

The CSI driver requires an Azure service principal with Contributor access to the cluster's resource group. Clouder will:

  1. Use existing AZURE_TENANT_ID, AZURE_CLIENT_ID, AZURE_CLIENT_SECRET environment variables if set.
  2. Otherwise, auto-create a new SP (clouder-<name>-csi) scoped to the resource group via az ad sp create-for-rbac.
  3. Skip the step with a warning if credentials cannot be obtained.

Generate operator Helm values JSON (recommended):

Use the Azure CLI helper to generate a Helm-ready JSON file containing all required Azure fields for:

  • operator.cloudCredentials.azure.tenantId
  • operator.cloudCredentials.azure.clientId
  • operator.cloudCredentials.azure.clientSecret
  • operator.cloudCredentials.azure.subscriptionId
  • operator.cloudCredentials.azure.resourceGroup
clouder azure helm-values --cluster my-cluster -o datalayer-operator-azure.json

If --output is omitted, the file is created at ~/.clouder/kubeadm/<cluster>/datalayer-operator-azure.json, which is the default path consumed by plane/datalayer_plane/sbin/up.sh for datalayer_operator.

Then pass it to Helm:

helm upgrade --install datalayer-operator \
./etc/helm/charts/datalayer-operator \
--namespace datalayer-runtimes \
--values datalayer-operator-azure.json

Manual installation (if Step 6 was skipped):

# 1. Create a service principal
az ad sp create-for-rbac --name clouder-<name>-csi \
--role Contributor \
--scopes /subscriptions/<sub-id>/resourceGroups/<rg>

# 2. Create /etc/kubernetes/azure.json on every node (see Azure Disk CSI docs)
# 3. Install the CSI drivers
curl -skSL https://raw.githubusercontent.com/kubernetes-sigs/azuredisk-csi-driver/v1.30.3/deploy/install-driver.sh | bash -s v1.30.3 snapshot --
curl -skSL https://raw.githubusercontent.com/kubernetes-sigs/azurefile-csi-driver/v1.30.6/deploy/install-driver.sh | bash -s v1.30.6 --

# 4. Create the StorageClasses
kubectl apply -f - <<EOF
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: managed-csi
annotations:
storageclass.kubernetes.io/is-default-class: "true"
provisioner: disk.csi.azure.com
parameters:
skuName: StandardSSD_LRS
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: azure-nfs
provisioner: file.csi.azure.com
allowVolumeExpansion: true
parameters:
shareName: datalayer-shared-filesystem
skuName: Premium_LRS
protocol: nfs
subscriptionID: "<sub-id>"
resourceGroup: "<rg>"
location: "<region>"
EOF

Use smoke-test to validate the full checkpoint/restore pipeline and ingress load balancer.

See the CRIU documentation for checkpoint and restore workflows.

tip

All VMs are on the same subnet (10.0.0.0/24), so Kubeadm nodes discover each other using private IPs. The setup command handles this automatically.

note

Supported cloud providers: Azure and AWS.


Run kubectl commands using the persisted kubeconfig for a cluster.

Full reference: clouder kubectl.

clouder kubectl [name] <kubectl-args...>
ArgumentDescription
name (optional)Cluster name. If omitted, Clouder uses the default kubeadm cluster.
kubectl-argsAny kubectl arguments (e.g. get nodes, apply -f ...)

Example:

clouder kubectl my-cluster get nodes
clouder kubectl my-cluster get pods -A
clouder kubectl my-cluster apply -f deployment.yaml
clouder kubectl my-cluster logs my-pod

This is equivalent to running:

kubectl --kubeconfig=~/.clouder/kubeadm/my-cluster/kubeconfig get nodes

Run helm commands using the persisted kubeconfig for a cluster.

Full reference: clouder helm.

clouder helm <name> <helm-args...>
ArgumentDescription
name (required)Cluster name (kubeconfig must exist in ~/.clouder/kubeadm/<name>/)
helm-argsAny helm arguments (e.g. list -A, install ...)

Example:

clouder helm my-cluster list -A
clouder helm my-cluster install my-release my-chart
clouder helm my-cluster upgrade my-release my-chart
clouder helm my-cluster uninstall my-release

This is equivalent to running:

helm --kubeconfig=~/.clouder/kubeadm/my-cluster/kubeconfig list -A