CRIU - Checkpoint/Restore In Userspace

Clouder integrates CRIU (Checkpoint/Restore In Userspace) with Kubernetes to enable snapshotting running pods and restoring them on the same or different nodes. This is the core differentiator of Clouder-managed clusters: the ability to freeze a running computation (e.g., a Jupyter notebook session, an AI training job) and resume it later, potentially on different hardware.

Overview

┌─────────────────────────────────────────────────────┐
│                  Running Pod                         │
│  ┌─────────────────────────────────────────────┐    │
│  │  Container (e.g., Jupyter notebook)          │    │
│  │  - Process state (CPU registers, memory)    │    │
│  │  - Open files, sockets                      │    │
│  │  - Filesystem changes                       │    │
│  └─────────────────────────────────────────────┘    │
└──────────────────────┬──────────────────────────────┘
                       │
            ┌──────────▼──────────┐
            │  CRIU Checkpoint    │
            │  (kubelet API)      │
            └──────────┬──────────┘
                       │
            ┌──────────▼──────────┐
            │  Checkpoint Archive │
            │  (.tar)             │
            │  - Memory pages     │
            │  - Process tree     │
            │  - File descriptors │
            │  - Network state    │
            └──────────┬──────────┘
                       │
          ┌────────────┼────────────┐
          ▼            ▼            ▼
    ┌──────────┐ ┌──────────┐ ┌──────────┐
    │  Local   │ │   S3     │ │  Azure   │
    │  Disk    │ │  Bucket  │ │  Blob    │
    └──────────┘ └──────────┘ └──────────┘
                       │
            ┌──────────▼──────────┐
            │   CRIU Restore      │
            │   (new Pod)         │
            └──────────┬──────────┘
                       │
            ┌──────────▼──────────┐
            │   Restored Pod      │
            │   (same state,      │
            │    different node)  │
            └─────────────────────┘

Why CRIU for Kubernetes?

Use Cases

Use Case	Description
Jupyter session persistence	Freeze a running notebook with all variables in memory, resume later on a different machine
Cost optimization	Checkpoint GPU workloads, release expensive GPU VMs, restore when needed
Live migration	Move running pods between nodes without downtime
Faster startup	Checkpoint an initialized application, restore instead of cold-starting
Disaster recovery	Periodic checkpoints as recovery points for stateful workloads
Spot instance tolerance	Checkpoint before spot/preemptible VM eviction, restore on a new instance

How CRIU Works

CRIU operates at the Linux process level:

Freeze: Pauses all processes in a container using the cgroup freezer
Dump: Serializes process state to disk — memory pages, CPU registers, file descriptors, signal handlers, IPC, timers, network sockets
Restore: Recreates the process tree from the dump, restoring all state exactly as it was

Since Kubernetes 1.25, the Kubelet Checkpoint API (ContainerCheckpoint feature gate) exposes CRIU through a standardized interface.

Prerequisites

Clouder clusters created with clouder kubeadm setup automatically configure all prerequisites:

Linux kernel ≥ 5.4 (Ubuntu 22.04+ / Debian 12+)
CRIU ≥ 3.17 installed on all nodes
containerd 2.x from Docker's official apt repository (supports CRI CheckpointContainer method)
Kubelet with ContainerCheckpoint=true feature gate
Kubernetes ≥ 1.25

To verify CRIU is working on a node:

# SSH into a cluster node
ssh azureuser@<node-ip>

# Check CRIU
criu check
# Should output: "Looks good."

# Check kubelet feature gate
cat /var/lib/kubelet/config.yaml | grep ContainerCheckpoint
# Should show: ContainerCheckpoint: true

Checkpoint Workflow

Step 1: Identify the Pod

# List running pods
kubectl get pods -n <namespace>

# Get pod details (note the container name and node)
kubectl describe pod <pod-name> -n <namespace>

Step 2: Create a Checkpoint

Use the Kubelet Checkpoint API:

# Checkpoint a container via the kubelet API
curl -X POST \
    "https://<node-ip>:10250/checkpoint/<namespace>/<pod-name>/<container-name>" \
    --cert /etc/kubernetes/pki/apiserver-kubelet-client.crt \
    --key /etc/kubernetes/pki/apiserver-kubelet-client.key \
    --cacert /etc/kubernetes/pki/ca.crt

Or using the Clouder CLI:

# Checkpoint a pod (interactive container selection if multiple)
clouder criu checkpoint <pod-name> --namespace <namespace>

# Checkpoint with a label
clouder criu checkpoint <pod-name> \
    --namespace <namespace> \
    --label "before-upgrade"

# Checkpoint and store to S3
clouder criu checkpoint <pod-name> \
    --namespace <namespace> \
    --storage s3://my-bucket/checkpoints/

Step 3: Verify the Checkpoint

# List checkpoints for a pod
clouder criu ls --pod <pod-name>

# Inspect a checkpoint
clouder criu inspect <checkpoint-id>

Checkpoint Storage

Checkpoints are stored as tar archives. Clouder supports multiple storage backends:

Backend	Configuration	Best For
Local disk	`/var/lib/kubelet/checkpoints/`	Development, single-node
S3	`s3://<bucket>/checkpoints/`	Production, multi-cloud
Azure Blob	`az://<container>/checkpoints/`	Azure-native deployments
NFS	`/mnt/nfs/checkpoints/`	Shared storage clusters

Configure the default storage backend:

# Set default checkpoint storage
clouder criu configure --storage s3://my-bucket/checkpoints/

Restore Workflow

Step 1: List Available Checkpoints

# List all checkpoints
clouder criu ls

# Filter by pod name or label
clouder criu ls --pod jupyter-abc123
clouder criu ls --label "before-upgrade"

Step 2: Restore a Checkpoint

# Restore to a new pod (same node)
clouder criu restore <checkpoint-id>

# Restore to a specific node
clouder criu restore <checkpoint-id> --node <node-name>

# Restore to a different namespace
clouder criu restore <checkpoint-id> --namespace <target-namespace>

# Restore with a new pod name
clouder criu restore <checkpoint-id> --name <new-pod-name>

Step 3: Verify the Restore

# The restored pod should be running with the same state
kubectl get pods -n <namespace>
kubectl logs <restored-pod-name> -n <namespace>

Technical Details

What Gets Checkpointed

Component	Included	Notes
Process memory	✅	All heap and stack pages
CPU registers	✅	Including FPU state
Open files	✅	Regular files, pipes
File locks	✅	flock, fcntl
Signal handlers	✅	Signal masks and pending signals
Timers	✅	POSIX timers, itimerval
IPC	✅	SysV IPC, POSIX MQ
TCP connections	⚠️	Requires `--tcp-established`
GPU state	❌	Future work (see Roadmap)
Persistent volumes	❌	Must be reattached on restore

Checkpoint Size Estimation

The checkpoint archive size depends on the container's memory footprint:

Checkpoint size ≈ RSS (Resident Set Size) + metadata overhead (~5%)

For example:

Jupyter notebook with 2 GB data loaded → ~2.1 GB checkpoint
AI training job with 16 GB model → ~16.8 GB checkpoint
Web server with 200 MB memory → ~210 MB checkpoint

Performance

Operation	Typical Duration	Notes
Checkpoint (1 GB)	2-5 seconds	Depends on disk I/O
Checkpoint (16 GB)	15-40 seconds	Memory-bound
Restore (1 GB)	1-3 seconds	Faster than checkpoint
Archive to S3 (1 GB)	5-15 seconds	Network-bound

Limitations

GPU state: CRIU cannot checkpoint GPU memory or CUDA contexts. GPU workloads must save model state to disk before checkpointing.
Network connections: TCP connections can be checkpointed with --tcp-established, but the remote end must still be listening on restore.
Runtime version: Checkpoint and restore must happen on the same kernel version.
containerd version: The checkpoint format may differ between container runtime version.

Jupyter Notebook Use Case

The primary Clouder use case is checkpointing Jupyter notebook sessions:

# 1. User is running a Jupyter notebook with loaded data
#    Pod: jupyter-user-abc123, Container: notebook

# 2. User wants to stop working but preserve state
clouder criu checkpoint jupyter-user-abc123 \
    --namespace jupyter \
    --storage s3://clouder-checkpoints/ \
    --label "end-of-day"

# 3. Later (hours, days), restore the session
clouder criu restore <checkpoint-id> \
    --namespace jupyter \
    --node gpu-worker-1  # Optionally on a different node

# 4. User reconnects to the notebook — all variables, imports,
#    and computation state are exactly as they left them

Benefits for Jupyter Users

No re-execution: No need to re-run cells to rebuild state
Hardware flexibility: Checkpoint on CPU, restore on GPU (for inference)
Cost savings: Release expensive GPU VMs when not actively computing
Session persistence: Survive node failures and cluster maintenance

Clouder Operator Integration

The Clouder Kubernetes operator can automate checkpointing:

# Example CRD for automated checkpointing
apiVersion: clouder.datalayer.io/v1
kind: CheckpointPolicy
metadata:
  name: jupyter-auto-checkpoint
spec:
  selector:
    matchLabels:
      app: jupyter
  schedule: "*/30 * * * *"  # Every 30 minutes
  storage:
    type: s3
    bucket: clouder-checkpoints
    prefix: auto/
  retention:
    maxCheckpoints: 5
    maxAge: 7d
  triggers:
    - type: idle           # Checkpoint when pod is idle
      idleTimeout: 15m
    - type: preemption     # Checkpoint before spot eviction

CLI Reference

# Checkpoint operations
clouder criu checkpoint <pod> [--namespace ns] [--storage path] [--label label]
clouder criu ls [--pod name] [--label label] [--storage path]
clouder criu inspect <checkpoint-id>
clouder criu delete <checkpoint-id>

# Restore operations
clouder criu restore <checkpoint-id> [--node node] [--namespace ns] [--name new-name]

# Configuration
clouder criu configure [--storage default-path]
clouder criu status  # Show CRIU status on all nodes

Roadmap

Phase 1: Basic checkpoint/restore via kubelet API (single node)
Phase 2: Cross-node restore with S3 checkpoint storage
Phase 3: Automated checkpoint policies via CRD/operator
Phase 4: Jupyter-specific integration (pre/post hooks for clean checkpoint)
Phase 5: GPU state preservation (CUDA checkpoint research)
Phase 6: Incremental checkpoints (only changed memory pages)

Overview​

Why CRIU for Kubernetes?​

Use Cases​

How CRIU Works​

Prerequisites​

Checkpoint Workflow​

Step 1: Identify the Pod​

Step 2: Create a Checkpoint​

Step 3: Verify the Checkpoint​

Checkpoint Storage​

Restore Workflow​

Step 1: List Available Checkpoints​

Step 2: Restore a Checkpoint​

Step 3: Verify the Restore​

Technical Details​

What Gets Checkpointed​

Checkpoint Size Estimation​

Performance​

Limitations​

Jupyter Notebook Use Case​

Benefits for Jupyter Users​

Clouder Operator Integration​

CLI Reference​

Roadmap​