CRIU - Checkpoint/Restore In Userspace
Clouder integrates CRIU (Checkpoint/Restore In Userspace) with Kubernetes to enable snapshotting running pods and restoring them on the same or different nodes. This is the core differentiator of Clouder-managed clusters: the ability to freeze a running computation (e.g., a Jupyter notebook session, an AI training job) and resume it later, potentially on different hardware.
Overview
┌─────────────────────────────────────────────────────┐
│ Running Pod │
│ ┌─────────────────────────────────────────────┐ │
│ │ Container (e.g., Jupyter notebook) │ │
│ │ - Process state (CPU registers, memory) │ │
│ │ - Open files, sockets │ │
│ │ - Filesystem changes │ │
│ └─────────────────────────────────────────────┘ │
└──────────────────────┬──────────────────────────────┘
│
┌──────────▼──────────┐
│ CRIU Checkpoint │
│ (kubelet API) │
└──────────┬──────────┘
│
┌──────────▼──────────┐
│ Checkpoint Archive │
│ (.tar) │
│ - Memory pages │
│ - Process tree │
│ - File descriptors │
│ - Network state │
└──────────┬──────────┘
│
┌────────────┼────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Local │ │ S3 │ │ Azure │
│ Disk │ │ Bucket │ │ Blob │
└──────────┘ └──────────┘ └──────────┘
│
┌──────────▼──────────┐
│ CRIU Restore │
│ (new Pod) │
└──────────┬──────────┘
│
┌──────────▼──────────┐
│ Restored Pod │
│ (same state, │
│ different node) │
└─────────────────────┘
Why CRIU for Kubernetes?
Use Cases
| Use Case | Description |
|---|---|
| Jupyter session persistence | Freeze a running notebook with all variables in memory, resume later on a different machine |
| Cost optimization | Checkpoint GPU workloads, release expensive GPU VMs, restore when needed |
| Live migration | Move running pods between nodes without downtime |
| Faster startup | Checkpoint an initialized application, restore instead of cold-starting |
| Disaster recovery | Periodic checkpoints as recovery points for stateful workloads |
| Spot instance tolerance | Checkpoint before spot/preemptible VM eviction, restore on a new instance |
How CRIU Works
CRIU operates at the Linux process level:
- Freeze: Pauses all processes in a container using the cgroup freezer
- Dump: Serializes process state to disk — memory pages, CPU registers, file descriptors, signal handlers, IPC, timers, network sockets
- Restore: Recreates the process tree from the dump, restoring all state exactly as it was
Since Kubernetes 1.25, the Kubelet Checkpoint API (ContainerCheckpoint feature gate) exposes CRIU through a standardized interface.
Prerequisites
Clouder clusters created with clouder kubeadm setup automatically configure all prerequisites:
- Linux kernel ≥ 5.4 (Ubuntu 22.04+ / Debian 12+)
- CRIU ≥ 3.17 installed on all nodes
- containerd 2.x from Docker's official apt repository (supports CRI
CheckpointContainermethod) - Kubelet with
ContainerCheckpoint=truefeature gate - Kubernetes ≥ 1.25
To verify CRIU is working on a node:
# SSH into a cluster node
ssh azureuser@<node-ip>
# Check CRIU
criu check
# Should output: "Looks good."
# Check kubelet feature gate
cat /var/lib/kubelet/config.yaml | grep ContainerCheckpoint
# Should show: ContainerCheckpoint: true
Checkpoint Workflow
Step 1: Identify the Pod
# List running pods
kubectl get pods -n <namespace>
# Get pod details (note the container name and node)
kubectl describe pod <pod-name> -n <namespace>
Step 2: Create a Checkpoint
Use the Kubelet Checkpoint API:
# Checkpoint a container via the kubelet API
curl -X POST \
"https://<node-ip>:10250/checkpoint/<namespace>/<pod-name>/<container-name>" \
--cert /etc/kubernetes/pki/apiserver-kubelet-client.crt \
--key /etc/kubernetes/pki/apiserver-kubelet-client.key \
--cacert /etc/kubernetes/pki/ca.crt
Or using the Clouder CLI:
# Checkpoint a pod (interactive container selection if multiple)
clouder criu checkpoint <pod-name> --namespace <namespace>
# Checkpoint with a label
clouder criu checkpoint <pod-name> \
--namespace <namespace> \
--label "before-upgrade"
# Checkpoint and store to S3
clouder criu checkpoint <pod-name> \
--namespace <namespace> \
--storage s3://my-bucket/checkpoints/
Step 3: Verify the Checkpoint
# List checkpoints for a pod
clouder criu ls --pod <pod-name>
# Inspect a checkpoint
clouder criu inspect <checkpoint-id>
Checkpoint Storage
Checkpoints are stored as tar archives. Clouder supports multiple storage backends:
| Backend | Configuration | Best For |
|---|---|---|
| Local disk | /var/lib/kubelet/checkpoints/ | Development, single-node |
| S3 | s3://<bucket>/checkpoints/ | Production, multi-cloud |
| Azure Blob | az://<container>/checkpoints/ | Azure-native deployments |
| NFS | /mnt/nfs/checkpoints/ | Shared storage clusters |
Configure the default storage backend:
# Set default checkpoint storage
clouder criu configure --storage s3://my-bucket/checkpoints/
Restore Workflow
Step 1: List Available Checkpoints
# List all checkpoints
clouder criu ls
# Filter by pod name or label
clouder criu ls --pod jupyter-abc123
clouder criu ls --label "before-upgrade"
Step 2: Restore a Checkpoint
# Restore to a new pod (same node)
clouder criu restore <checkpoint-id>
# Restore to a specific node
clouder criu restore <checkpoint-id> --node <node-name>
# Restore to a different namespace
clouder criu restore <checkpoint-id> --namespace <target-namespace>
# Restore with a new pod name
clouder criu restore <checkpoint-id> --name <new-pod-name>
Step 3: Verify the Restore
# The restored pod should be running with the same state
kubectl get pods -n <namespace>
kubectl logs <restored-pod-name> -n <namespace>
Technical Details
What Gets Checkpointed
| Component | Included | Notes |
|---|---|---|
| Process memory | ✅ | All heap and stack pages |
| CPU registers | ✅ | Including FPU state |
| Open files | ✅ | Regular files, pipes |
| File locks | ✅ | flock, fcntl |
| Signal handlers | ✅ | Signal masks and pending signals |
| Timers | ✅ | POSIX timers, itimerval |
| IPC | ✅ | SysV IPC, POSIX MQ |
| TCP connections | ⚠️ | Requires --tcp-established |
| GPU state | ❌ | Future work (see Roadmap) |
| Persistent volumes | ❌ | Must be reattached on restore |
Checkpoint Size Estimation
The checkpoint archive size depends on the container's memory footprint:
Checkpoint size ≈ RSS (Resident Set Size) + metadata overhead (~5%)
For example:
- Jupyter notebook with 2 GB data loaded → ~2.1 GB checkpoint
- AI training job with 16 GB model → ~16.8 GB checkpoint
- Web server with 200 MB memory → ~210 MB checkpoint
Performance
| Operation | Typical Duration | Notes |
|---|---|---|
| Checkpoint (1 GB) | 2-5 seconds | Depends on disk I/O |
| Checkpoint (16 GB) | 15-40 seconds | Memory-bound |
| Restore (1 GB) | 1-3 seconds | Faster than checkpoint |
| Archive to S3 (1 GB) | 5-15 seconds | Network-bound |
Limitations
- GPU state: CRIU cannot checkpoint GPU memory or CUDA contexts. GPU workloads must save model state to disk before checkpointing.
- Network connections: TCP connections can be checkpointed with
--tcp-established, but the remote end must still be listening on restore. - Runtime version: Checkpoint and restore must happen on the same kernel version.
- containerd version: The checkpoint format may differ between container runtime version.
Jupyter Notebook Use Case
The primary Clouder use case is checkpointing Jupyter notebook sessions:
# 1. User is running a Jupyter notebook with loaded data
# Pod: jupyter-user-abc123, Container: notebook
# 2. User wants to stop working but preserve state
clouder criu checkpoint jupyter-user-abc123 \
--namespace jupyter \
--storage s3://clouder-checkpoints/ \
--label "end-of-day"
# 3. Later (hours, days), restore the session
clouder criu restore <checkpoint-id> \
--namespace jupyter \
--node gpu-worker-1 # Optionally on a different node
# 4. User reconnects to the notebook — all variables, imports,
# and computation state are exactly as they left them
Benefits for Jupyter Users
- No re-execution: No need to re-run cells to rebuild state
- Hardware flexibility: Checkpoint on CPU, restore on GPU (for inference)
- Cost savings: Release expensive GPU VMs when not actively computing
- Session persistence: Survive node failures and cluster maintenance
Clouder Operator Integration
The Clouder Kubernetes operator can automate checkpointing:
# Example CRD for automated checkpointing
apiVersion: clouder.datalayer.io/v1
kind: CheckpointPolicy
metadata:
name: jupyter-auto-checkpoint
spec:
selector:
matchLabels:
app: jupyter
schedule: "*/30 * * * *" # Every 30 minutes
storage:
type: s3
bucket: clouder-checkpoints
prefix: auto/
retention:
maxCheckpoints: 5
maxAge: 7d
triggers:
- type: idle # Checkpoint when pod is idle
idleTimeout: 15m
- type: preemption # Checkpoint before spot eviction
CLI Reference
# Checkpoint operations
clouder criu checkpoint <pod> [--namespace ns] [--storage path] [--label label]
clouder criu ls [--pod name] [--label label] [--storage path]
clouder criu inspect <checkpoint-id>
clouder criu delete <checkpoint-id>
# Restore operations
clouder criu restore <checkpoint-id> [--node node] [--namespace ns] [--name new-name]
# Configuration
clouder criu configure [--storage default-path]
clouder criu status # Show CRIU status on all nodes
Roadmap
- Phase 1: Basic checkpoint/restore via kubelet API (single node)
- Phase 2: Cross-node restore with S3 checkpoint storage
- Phase 3: Automated checkpoint policies via CRD/operator
- Phase 4: Jupyter-specific integration (pre/post hooks for clean checkpoint)
- Phase 5: GPU state preservation (CUDA checkpoint research)
- Phase 6: Incremental checkpoints (only changed memory pages)