☰ ⚡ Datalayer Ray Addon

KubernetesAddons

Deploy Datalayer Ray Addon

Plane
Helm
Terraform

export DATALAYER_RUN_URL=https://r-eastus.datalayer.run
plane up datalayer-ray

export RELEASE=datalayer-ray
export NAMESPACE=datalayer-api
...

cd terraform
...

Plane
Helm

plane ls

helm ls -A

Overview

This page describes the productized Ray addon architecture for Datalayer services.

The design follows KubeRay recommendations:

KubeRay Operator manages RayCluster, RayJob, and RayService CRDs.
A Datalayer addon API provides a stable REST surface for Datalayer CLI and SDK.
A dedicated datalayer-ray chart deploys the addon API and can install KubeRay.

For Ray and KubeRay references, see:

Goals

Expose a simple REST API to create/list/delete Ray clusters.
Expose a simple REST API to submit and monitor Ray jobs.
Integrate this API in datalayer ray ... CLI and SDK primitives.
Package deployables for production usage via Plane and Helm.

Current Architecture

REST API Contract

Base path: /api/ray/v1

All resource endpoints are authenticated and require platform_member role.

Health and Version

GET /healthz
GET /api/ray/v1/version

RayCluster Management

GET /api/ray/v1/clusters?namespace=<ns>
POST /api/ray/v1/clusters
GET /api/ray/v1/clusters/{name}?namespace=<ns>
DELETE /api/ray/v1/clusters/{name}?namespace=<ns>

RayJob Management

POST /api/ray/v1/clusters/{cluster_name}/jobs
GET /api/ray/v1/jobs?namespace=<ns>&cluster_name=<optional>
GET /api/ray/v1/jobs/{name}?namespace=<ns>
DELETE /api/ray/v1/jobs/{name}?namespace=<ns>
GET /api/ray/v1/jobs/{name}/logs?namespace=<ns>&pod_name=<optional>&container=<optional>&tail_lines=200
GET /api/ray/v1/jobs/{name}/events?namespace=<ns>&limit=100

Core CLI and SDK

Python Core now exposes:

New URL config: DATALAYER_RUN_URL

Default: https://prod1.datalayer.run

SDK primitives through RayMixin:
- ray_list_clusters, ray_create_cluster, ray_get_cluster, ray_delete_cluster

ray_submit_job, ray_list_jobs, ray_get_job, ray_delete_job
ray_get_job_logs, ray_get_job_events

CLI command group:

datalayer ray clusters ls|list|get|create|delete
datalayer ray jobs ls|list|submit|status|monitor|delete|logs|events

Point to a Non-Default Ray URL

Use any of the following when your Ray addon is not exposed on https://prod1.datalayer.run.

Per command override:

datalayer ray jobs ls --ray-url https://ray.my-company.net --namespace default

Global override for one CLI invocation:

datalayer --ray-url https://ray.my-company.net ray clusters ls --namespace default

Environment variable override (current shell):

export DATALAYER_RUN_URL=https://ray.my-company.net

To keep this override across terminal sessions on Linux/macOS:

echo 'export DATALAYER_RUN_URL=https://ray.my-company.net' >> ~/.bashrc
source ~/.bashrc

Deployment

Helm Chart

Behavior:

Deploy datalayer-ray API service.
Install kuberay-operator as a chart dependency (enabled by default).
Optionally bootstrap a default RayCluster via bootstrapRayCluster.enabled=true.

Autoscaling

Ray worker autoscaling is supported through KubeRay and enabled by default for bootstrap and API-created clusters.

When enabled, the RayCluster spec includes:

spec.enableInTreeAutoscaling: true
Worker bounds from minReplicas and maxReplicas

Relevant defaults in datalayer-ray chart values:

bootstrapRayCluster:
  autoscaling:
    enabled: true
  worker:
    replicas: 1
    minReplicas: 1
    maxReplicas: 3

Notes:

Autoscaling acts within each worker group's minReplicas/maxReplicas bounds.
Existing RayClusters need an update/recreate to pick up newly enabled autoscaling fields.

Ingress host for plane up datalayer-ray is resolved from DATALAYER_RUN_URL.

Usage

Cluster lifecycle

datalayer ray clusters create my-ray --namespace default --worker-replicas 2
datalayer ray clusters list --namespace default
datalayer ray clusters get my-ray --namespace default

Submit and check a job

datalayer ray jobs submit my-ray --namespace default --entrypoint "python /workspace/train.py"
datalayer ray jobs list --namespace default --cluster-name my-ray
datalayer ray jobs status my-ray-job-<timestamp> --namespace default
datalayer ray jobs logs my-ray-job-<timestamp> --namespace default --tail-lines 300
datalayer ray jobs events my-ray-job-<timestamp> --namespace default --limit 100

Hands-On Examples

This section is fully copy/paste friendly and does not depend on repository file paths.

Prerequisites

export DATALAYER_RUN_URL=https://r-eastus.datalayer.run
# Set your token if not already stored by datalayer login
export DATALAYER_API_KEY=<your-token>

1) Simple health and connectivity

# Verify the Ray addon API is healthy.
curl -sS "${DATALAYER_RUN_URL}/api/ray/v1/version" \
  -H "Authorization: Bearer ${DATALAYER_API_KEY}" | jq

# List clusters through the Datalayer CLI.
datalayer ray clusters ls --namespace default

2) Create a cluster and scale up/down

# Create a Ray cluster.
datalayer ray clusters create  my-ray --namespace default --worker-replicas 1

# Verify state.
datalayer ray clusters get  my-ray --namespace default

# Scale up workers to 3.
kubectl -n default patch raycluster  my-ray --type='json' -p='[
  {"op":"replace","path":"/spec/workerGroupSpecs/0/replicas","value":3},
  {"op":"replace","path":"/spec/workerGroupSpecs/0/minReplicas","value":1},
  {"op":"replace","path":"/spec/workerGroupSpecs/0/maxReplicas","value":6}
]'

# Confirm scale-up.
datalayer ray clusters get  my-ray --namespace default

# Scale down workers to 0.
kubectl -n default patch raycluster  my-ray --type='json' -p='[
  {"op":"replace","path":"/spec/workerGroupSpecs/0/replicas","value":0},
  {"op":"replace","path":"/spec/workerGroupSpecs/0/minReplicas","value":0},
  {"op":"replace","path":"/spec/workerGroupSpecs/0/maxReplicas","value":6}
]'

# Confirm scale-down.
datalayer ray clusters get  my-ray --namespace default

3) Run a real Python file

No intermediary file creation or copy is required. --py supports multiline source via stdin (@-). Use unique job names when re-running examples to avoid 409 AlreadyExists.

Example A: Hello Ray

JOB_NAME="hello-ray-$(date +%s)"

datalayer ray jobs submit  my-ray --namespace default --job-name "${JOB_NAME}" --py @- <<'PY'
import ray

ray.init(address="auto")

def square(x: int) -> int:
  return x * x

remote_square = ray.remote(square)
nums = list(range(10))
print("Output:", ray.get([remote_square.remote(n) for n in nums]))
PY

datalayer ray jobs monitor "${JOB_NAME}" --namespace default
datalayer ray jobs logs "${JOB_NAME}" --namespace default --tail-lines 200

Example B: Monte Carlo Pi

JOB_NAME="pi-monte-carlo-$(date +%s)"

datalayer ray jobs submit my-ray --namespace default --job-name "${JOB_NAME}" --py @- <<'PY'
import random
import ray

ray.init(address="auto")

workers = 16
samples_per_worker = 100_000

def count_inside(num_samples: int) -> int:
  inside = 0
  for _ in range(num_samples):
    x = random.random()
    y = random.random()
    if x * x + y * y <= 1.0:
      inside += 1
  return inside

remote_count_inside = ray.remote(count_inside)
inside_total = sum(ray.get([remote_count_inside.remote(samples_per_worker) for _ in range(workers)]))
total_samples = workers * samples_per_worker
pi_estimate = 4.0 * inside_total / total_samples

print("Workers:", workers)
print("Samples per worker:", samples_per_worker)
print("Total samples:", total_samples)
print("Estimated pi:", pi_estimate)
PY

datalayer ray jobs monitor "${JOB_NAME}" --namespace default
datalayer ray jobs logs "${JOB_NAME}" --namespace default --tail-lines 200

Example C: Stateful Actor Counter

JOB_NAME="actor-counter-$(date +%s)"

datalayer ray jobs submit my-ray --namespace default --job-name "${JOB_NAME}" --py @- <<'PY'
import ray

ray.init(address="auto")

class Counter:
  def __init__(self):
    self.value = 0

  def add(self, amount: int) -> int:
    self.value += amount
    return self.value

  def get(self) -> int:
    return self.value

RemoteCounter = ray.remote(Counter)
counters = [RemoteCounter.remote() for _ in range(4)]

for step in [1, 2, 3, 4, 5]:
  ray.get([c.add.remote(step) for c in counters])

totals = ray.get([c.get.remote() for c in counters])
print("Per-actor totals:", totals)
print("Grand total:", sum(totals))
PY

datalayer ray jobs monitor "${JOB_NAME}" --namespace default
datalayer ray jobs logs "${JOB_NAME}" --namespace default --tail-lines 200

Security and RBAC

The chart creates a ClusterRole and ClusterRoleBinding allowing:

rayclusters and rayjobs CRUD across namespaces.
pods read/list/watch for job pod discovery.
pods/log read for job logs.
events read/list/watch for job events.

For multi-tenant clusters, prefer one namespace per team/project and deploy one addon instance per namespace.

Implementation Phases

Phase 1 (done): API contract + core CLI/SDK + deploy artifacts.
Phase 2 (next): Add stricter authn/authz checks and audit metadata.
Phase 3 (next): Add richer job logs/events endpoints.
Phase 4 (next): Add RayService API support for long-running serving workloads.

Open Questions To Confirm

Do we want organization/team scoped authz in this addon from day 1, or after internal validation?
Should we add RayService endpoints now or keep them for phase 4?
Do we standardize one namespace per account/team for all Ray resources?

Delete the Cluster

datalayer ray jobs ls --namespace default --cluster-name  my-ray
datalayer ray clusters delete  my-ray --namespace default

Deploy Datalayer Ray Addon​

Overview​

Goals​

Current Architecture​

REST API Contract​

Health and Version​

RayCluster Management​

RayJob Management​

Core CLI and SDK​

Point to a Non-Default Ray URL​

Deployment​

Helm Chart​

Autoscaling​

Usage​

Cluster lifecycle​

Submit and check a job​

Hands-On Examples​

Prerequisites​

1) Simple health and connectivity​

2) Create a cluster and scale up/down​

3) Run a real Python file​

Example A: Hello Ray​

Example B: Monte Carlo Pi​

Example C: Stateful Actor Counter​

Security and RBAC​

Implementation Phases​

Open Questions To Confirm​

Delete the Cluster​