Skip to main content

CRIU - Checkpoint/Restore In Userspace

Clouder integrates CRIU (Checkpoint/Restore In Userspace) with Kubernetes to enable snapshotting running pods and restoring them on the same or different nodes. This is the core differentiator of Clouder-managed clusters: the ability to freeze a running computation (e.g., a Jupyter notebook session, an AI training job) and resume it later, potentially on different hardware.

Overview

┌─────────────────────────────────────────────────────┐
│ Running Pod │
│ ┌─────────────────────────────────────────────┐ │
│ │ Container (e.g., Jupyter notebook) │ │
│ │ - Process state (CPU registers, memory) │ │
│ │ - Open files, sockets │ │
│ │ - Filesystem changes │ │
│ └─────────────────────────────────────────────┘ │
└──────────────────────┬──────────────────────────────┘

┌──────────▼──────────┐
│ CRIU Checkpoint │
│ (kubelet API) │
└──────────┬──────────┘

┌──────────▼──────────┐
│ Checkpoint Archive │
│ (.tar) │
│ - Memory pages │
│ - Process tree │
│ - File descriptors │
│ - Network state │
└──────────┬──────────┘

┌────────────┼────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Local │ │ S3 │ │ Azure │
│ Disk │ │ Bucket │ │ Blob │
└──────────┘ └──────────┘ └──────────┘

┌──────────▼──────────┐
│ CRIU Restore │
│ (new Pod) │
└──────────┬──────────┘

┌──────────▼──────────┐
│ Restored Pod │
│ (same state, │
│ different node) │
└─────────────────────┘

Why CRIU for Kubernetes?

Use Cases

Use CaseDescription
Jupyter session persistenceFreeze a running notebook with all variables in memory, resume later on a different machine
Cost optimizationCheckpoint GPU workloads, release expensive GPU VMs, restore when needed
Live migrationMove running pods between nodes without downtime
Faster startupCheckpoint an initialized application, restore instead of cold-starting
Disaster recoveryPeriodic checkpoints as recovery points for stateful workloads
Spot instance toleranceCheckpoint before spot/preemptible VM eviction, restore on a new instance

How CRIU Works

CRIU operates at the Linux process level:

  1. Freeze: Pauses all processes in a container using the cgroup freezer
  2. Dump: Serializes process state to disk — memory pages, CPU registers, file descriptors, signal handlers, IPC, timers, network sockets
  3. Restore: Recreates the process tree from the dump, restoring all state exactly as it was

Since Kubernetes 1.25, the Kubelet Checkpoint API (ContainerCheckpoint feature gate) exposes CRIU through a standardized interface.

Prerequisites

Clouder clusters created with clouder kubeadm setup automatically configure all prerequisites:

  • Linux kernel ≥ 5.4 (Ubuntu 22.04+ / Debian 12+)
  • CRIU ≥ 3.17 installed on all nodes
  • containerd 2.x from Docker's official apt repository (supports CRI CheckpointContainer method)
  • Kubelet with ContainerCheckpoint=true feature gate
  • Kubernetes ≥ 1.25

To verify CRIU is working on a node:

# SSH into a cluster node
ssh azureuser@<node-ip>

# Check CRIU
criu check
# Should output: "Looks good."

# Check kubelet feature gate
cat /var/lib/kubelet/config.yaml | grep ContainerCheckpoint
# Should show: ContainerCheckpoint: true

Checkpoint Workflow

Step 1: Identify the Pod

# List running pods
kubectl get pods -n <namespace>

# Get pod details (note the container name and node)
kubectl describe pod <pod-name> -n <namespace>

Step 2: Create a Checkpoint

Use the Kubelet Checkpoint API:

# Checkpoint a container via the kubelet API
curl -X POST \
"https://<node-ip>:10250/checkpoint/<namespace>/<pod-name>/<container-name>" \
--cert /etc/kubernetes/pki/apiserver-kubelet-client.crt \
--key /etc/kubernetes/pki/apiserver-kubelet-client.key \
--cacert /etc/kubernetes/pki/ca.crt

Or using the Clouder CLI:

# Checkpoint a pod (interactive container selection if multiple)
clouder criu checkpoint <pod-name> --namespace <namespace>

# Checkpoint with a label
clouder criu checkpoint <pod-name> \
--namespace <namespace> \
--label "before-upgrade"

# Checkpoint and store to S3
clouder criu checkpoint <pod-name> \
--namespace <namespace> \
--storage s3://my-bucket/checkpoints/

Step 3: Verify the Checkpoint

# List checkpoints for a pod
clouder criu ls --pod <pod-name>

# Inspect a checkpoint
clouder criu inspect <checkpoint-id>

Checkpoint Storage

Checkpoints are stored as tar archives. Clouder supports multiple storage backends:

BackendConfigurationBest For
Local disk/var/lib/kubelet/checkpoints/Development, single-node
S3s3://<bucket>/checkpoints/Production, multi-cloud
Azure Blobaz://<container>/checkpoints/Azure-native deployments
NFS/mnt/nfs/checkpoints/Shared storage clusters

Configure the default storage backend:

# Set default checkpoint storage
clouder criu configure --storage s3://my-bucket/checkpoints/

Restore Workflow

Step 1: List Available Checkpoints

# List all checkpoints
clouder criu ls

# Filter by pod name or label
clouder criu ls --pod jupyter-abc123
clouder criu ls --label "before-upgrade"

Step 2: Restore a Checkpoint

# Restore to a new pod (same node)
clouder criu restore <checkpoint-id>

# Restore to a specific node
clouder criu restore <checkpoint-id> --node <node-name>

# Restore to a different namespace
clouder criu restore <checkpoint-id> --namespace <target-namespace>

# Restore with a new pod name
clouder criu restore <checkpoint-id> --name <new-pod-name>

Step 3: Verify the Restore

# The restored pod should be running with the same state
kubectl get pods -n <namespace>
kubectl logs <restored-pod-name> -n <namespace>

Technical Details

What Gets Checkpointed

ComponentIncludedNotes
Process memoryAll heap and stack pages
CPU registersIncluding FPU state
Open filesRegular files, pipes
File locksflock, fcntl
Signal handlersSignal masks and pending signals
TimersPOSIX timers, itimerval
IPCSysV IPC, POSIX MQ
TCP connections⚠️Requires --tcp-established
GPU stateFuture work (see Roadmap)
Persistent volumesMust be reattached on restore

Checkpoint Size Estimation

The checkpoint archive size depends on the container's memory footprint:

Checkpoint size ≈ RSS (Resident Set Size) + metadata overhead (~5%)

For example:

  • Jupyter notebook with 2 GB data loaded → ~2.1 GB checkpoint
  • AI training job with 16 GB model → ~16.8 GB checkpoint
  • Web server with 200 MB memory → ~210 MB checkpoint

Performance

OperationTypical DurationNotes
Checkpoint (1 GB)2-5 secondsDepends on disk I/O
Checkpoint (16 GB)15-40 secondsMemory-bound
Restore (1 GB)1-3 secondsFaster than checkpoint
Archive to S3 (1 GB)5-15 secondsNetwork-bound

Limitations

  1. GPU state: CRIU cannot checkpoint GPU memory or CUDA contexts. GPU workloads must save model state to disk before checkpointing.
  2. Network connections: TCP connections can be checkpointed with --tcp-established, but the remote end must still be listening on restore.
  3. Runtime version: Checkpoint and restore must happen on the same kernel version.
  4. containerd version: The checkpoint format may differ between container runtime version.

Jupyter Notebook Use Case

The primary Clouder use case is checkpointing Jupyter notebook sessions:

# 1. User is running a Jupyter notebook with loaded data
# Pod: jupyter-user-abc123, Container: notebook

# 2. User wants to stop working but preserve state
clouder criu checkpoint jupyter-user-abc123 \
--namespace jupyter \
--storage s3://clouder-checkpoints/ \
--label "end-of-day"

# 3. Later (hours, days), restore the session
clouder criu restore <checkpoint-id> \
--namespace jupyter \
--node gpu-worker-1 # Optionally on a different node

# 4. User reconnects to the notebook — all variables, imports,
# and computation state are exactly as they left them

Benefits for Jupyter Users

  • No re-execution: No need to re-run cells to rebuild state
  • Hardware flexibility: Checkpoint on CPU, restore on GPU (for inference)
  • Cost savings: Release expensive GPU VMs when not actively computing
  • Session persistence: Survive node failures and cluster maintenance

Clouder Operator Integration

The Clouder Kubernetes operator can automate checkpointing:

# Example CRD for automated checkpointing
apiVersion: clouder.datalayer.io/v1
kind: CheckpointPolicy
metadata:
name: jupyter-auto-checkpoint
spec:
selector:
matchLabels:
app: jupyter
schedule: "*/30 * * * *" # Every 30 minutes
storage:
type: s3
bucket: clouder-checkpoints
prefix: auto/
retention:
maxCheckpoints: 5
maxAge: 7d
triggers:
- type: idle # Checkpoint when pod is idle
idleTimeout: 15m
- type: preemption # Checkpoint before spot eviction

CLI Reference

# Checkpoint operations
clouder criu checkpoint <pod> [--namespace ns] [--storage path] [--label label]
clouder criu ls [--pod name] [--label label] [--storage path]
clouder criu inspect <checkpoint-id>
clouder criu delete <checkpoint-id>

# Restore operations
clouder criu restore <checkpoint-id> [--node node] [--namespace ns] [--name new-name]

# Configuration
clouder criu configure [--storage default-path]
clouder criu status # Show CRIU status on all nodes

Roadmap

  • Phase 1: Basic checkpoint/restore via kubelet API (single node)
  • Phase 2: Cross-node restore with S3 checkpoint storage
  • Phase 3: Automated checkpoint policies via CRD/operator
  • Phase 4: Jupyter-specific integration (pre/post hooks for clean checkpoint)
  • Phase 5: GPU state preservation (CUDA checkpoint research)
  • Phase 6: Incremental checkpoints (only changed memory pages)