☰ ⚡ Datalayer Ray Addon
Deploy Datalayer Ray Addon
- Plane
- Helm
- Terraform
export DATALAYER_RUN_URL=https://r-eastus.datalayer.run
plane up datalayer-ray
export RELEASE=datalayer-ray
export NAMESPACE=datalayer-api
...
cd terraform
...
- Plane
- Helm
plane ls
helm ls -A
Overview
This page describes the productized Ray addon architecture for Datalayer services.
The design follows KubeRay recommendations:
- KubeRay Operator manages
RayCluster,RayJob, andRayServiceCRDs. - A Datalayer addon API provides a stable REST surface for Datalayer CLI and SDK.
- A dedicated
datalayer-raychart deploys the addon API and can install KubeRay.
For Ray and KubeRay references, see:
Goals
- Expose a simple REST API to create/list/delete Ray clusters.
- Expose a simple REST API to submit and monitor Ray jobs.
- Integrate this API in
datalayer ray ...CLI and SDK primitives. - Package deployables for production usage via Plane and Helm.
Current Architecture
REST API Contract
Base path: /api/ray/v1
All resource endpoints are authenticated and require platform_member role.
Health and Version
GET /healthzGET /api/ray/v1/version
RayCluster Management
GET /api/ray/v1/clusters?namespace=<ns>POST /api/ray/v1/clustersGET /api/ray/v1/clusters/{name}?namespace=<ns>DELETE /api/ray/v1/clusters/{name}?namespace=<ns>
RayJob Management
POST /api/ray/v1/clusters/{cluster_name}/jobsGET /api/ray/v1/jobs?namespace=<ns>&cluster_name=<optional>GET /api/ray/v1/jobs/{name}?namespace=<ns>DELETE /api/ray/v1/jobs/{name}?namespace=<ns>GET /api/ray/v1/jobs/{name}/logs?namespace=<ns>&pod_name=<optional>&container=<optional>&tail_lines=200GET /api/ray/v1/jobs/{name}/events?namespace=<ns>&limit=100
Core CLI and SDK
Python Core now exposes:
- New URL config:
DATALAYER_RUN_URL
- Default:
https://prod1.datalayer.run
- SDK primitives through
RayMixin:ray_list_clusters,ray_create_cluster,ray_get_cluster,ray_delete_cluster
ray_submit_job,ray_list_jobs,ray_get_job,ray_delete_jobray_get_job_logs,ray_get_job_events
- CLI command group:
datalayer ray clusters ls|list|get|create|deletedatalayer ray jobs ls|list|submit|status|monitor|delete|logs|events
Point to a Non-Default Ray URL
Use any of the following when your Ray addon is not exposed on https://prod1.datalayer.run.
- Per command override:
datalayer ray jobs ls --ray-url https://ray.my-company.net --namespace default
- Global override for one CLI invocation:
datalayer --ray-url https://ray.my-company.net ray clusters ls --namespace default
- Environment variable override (current shell):
export DATALAYER_RUN_URL=https://ray.my-company.net
To keep this override across terminal sessions on Linux/macOS:
echo 'export DATALAYER_RUN_URL=https://ray.my-company.net' >> ~/.bashrc
source ~/.bashrc
Deployment
Helm Chart
Behavior:
- Deploy
datalayer-rayAPI service. - Install
kuberay-operatoras a chart dependency (enabled by default). - Optionally bootstrap a default
RayClusterviabootstrapRayCluster.enabled=true.
Autoscaling
Ray worker autoscaling is supported through KubeRay and enabled by default for bootstrap and API-created clusters.
When enabled, the RayCluster spec includes:
spec.enableInTreeAutoscaling: true- Worker bounds from
minReplicasandmaxReplicas
Relevant defaults in datalayer-ray chart values:
bootstrapRayCluster:
autoscaling:
enabled: true
worker:
replicas: 1
minReplicas: 1
maxReplicas: 3
Notes:
- Autoscaling acts within each worker group's
minReplicas/maxReplicasbounds. - Existing RayClusters need an update/recreate to pick up newly enabled autoscaling fields.
Ingress host for plane up datalayer-ray is resolved from DATALAYER_RUN_URL.
Usage
Cluster lifecycle
datalayer ray clusters create my-ray --namespace default --worker-replicas 2
datalayer ray clusters list --namespace default
datalayer ray clusters get my-ray --namespace default
Submit and check a job
datalayer ray jobs submit my-ray --namespace default --entrypoint "python /workspace/train.py"
datalayer ray jobs list --namespace default --cluster-name my-ray
datalayer ray jobs status my-ray-job-<timestamp> --namespace default
datalayer ray jobs logs my-ray-job-<timestamp> --namespace default --tail-lines 300
datalayer ray jobs events my-ray-job-<timestamp> --namespace default --limit 100
Hands-On Examples
This section is fully copy/paste friendly and does not depend on repository file paths.
Prerequisites
export DATALAYER_RUN_URL=https://r-eastus.datalayer.run
# Set your token if not already stored by datalayer login
export DATALAYER_API_KEY=<your-token>
1) Simple health and connectivity
# Verify the Ray addon API is healthy.
curl -sS "${DATALAYER_RUN_URL}/api/ray/v1/version" \
-H "Authorization: Bearer ${DATALAYER_API_KEY}" | jq
# List clusters through the Datalayer CLI.
datalayer ray clusters ls --namespace default
2) Create a cluster and scale up/down
# Create a Ray cluster.
datalayer ray clusters create my-ray --namespace default --worker-replicas 1
# Verify state.
datalayer ray clusters get my-ray --namespace default
# Scale up workers to 3.
kubectl -n default patch raycluster my-ray --type='json' -p='[
{"op":"replace","path":"/spec/workerGroupSpecs/0/replicas","value":3},
{"op":"replace","path":"/spec/workerGroupSpecs/0/minReplicas","value":1},
{"op":"replace","path":"/spec/workerGroupSpecs/0/maxReplicas","value":6}
]'
# Confirm scale-up.
datalayer ray clusters get my-ray --namespace default
# Scale down workers to 0.
kubectl -n default patch raycluster my-ray --type='json' -p='[
{"op":"replace","path":"/spec/workerGroupSpecs/0/replicas","value":0},
{"op":"replace","path":"/spec/workerGroupSpecs/0/minReplicas","value":0},
{"op":"replace","path":"/spec/workerGroupSpecs/0/maxReplicas","value":6}
]'
# Confirm scale-down.
datalayer ray clusters get my-ray --namespace default
3) Run a real Python file
No intermediary file creation or copy is required.
--py supports multiline source via stdin (@-).
Use unique job names when re-running examples to avoid 409 AlreadyExists.
Example A: Hello Ray
JOB_NAME="hello-ray-$(date +%s)"
datalayer ray jobs submit my-ray --namespace default --job-name "${JOB_NAME}" --py @- <<'PY'
import ray
ray.init(address="auto")
def square(x: int) -> int:
return x * x
remote_square = ray.remote(square)
nums = list(range(10))
print("Output:", ray.get([remote_square.remote(n) for n in nums]))
PY
datalayer ray jobs monitor "${JOB_NAME}" --namespace default
datalayer ray jobs logs "${JOB_NAME}" --namespace default --tail-lines 200
Example B: Monte Carlo Pi
JOB_NAME="pi-monte-carlo-$(date +%s)"
datalayer ray jobs submit my-ray --namespace default --job-name "${JOB_NAME}" --py @- <<'PY'
import random
import ray
ray.init(address="auto")
workers = 16
samples_per_worker = 100_000
def count_inside(num_samples: int) -> int:
inside = 0
for _ in range(num_samples):
x = random.random()
y = random.random()
if x * x + y * y <= 1.0:
inside += 1
return inside
remote_count_inside = ray.remote(count_inside)
inside_total = sum(ray.get([remote_count_inside.remote(samples_per_worker) for _ in range(workers)]))
total_samples = workers * samples_per_worker
pi_estimate = 4.0 * inside_total / total_samples
print("Workers:", workers)
print("Samples per worker:", samples_per_worker)
print("Total samples:", total_samples)
print("Estimated pi:", pi_estimate)
PY
datalayer ray jobs monitor "${JOB_NAME}" --namespace default
datalayer ray jobs logs "${JOB_NAME}" --namespace default --tail-lines 200
Example C: Stateful Actor Counter
JOB_NAME="actor-counter-$(date +%s)"
datalayer ray jobs submit my-ray --namespace default --job-name "${JOB_NAME}" --py @- <<'PY'
import ray
ray.init(address="auto")
class Counter:
def __init__(self):
self.value = 0
def add(self, amount: int) -> int:
self.value += amount
return self.value
def get(self) -> int:
return self.value
RemoteCounter = ray.remote(Counter)
counters = [RemoteCounter.remote() for _ in range(4)]
for step in [1, 2, 3, 4, 5]:
ray.get([c.add.remote(step) for c in counters])
totals = ray.get([c.get.remote() for c in counters])
print("Per-actor totals:", totals)
print("Grand total:", sum(totals))
PY
datalayer ray jobs monitor "${JOB_NAME}" --namespace default
datalayer ray jobs logs "${JOB_NAME}" --namespace default --tail-lines 200
Security and RBAC
The chart creates a ClusterRole and ClusterRoleBinding allowing:
rayclustersandrayjobsCRUD across namespaces.podsread/list/watch for job pod discovery.pods/logread for job logs.eventsread/list/watch for job events.
For multi-tenant clusters, prefer one namespace per team/project and deploy one addon instance per namespace.
Implementation Phases
- Phase 1 (done): API contract + core CLI/SDK + deploy artifacts.
- Phase 2 (next): Add stricter authn/authz checks and audit metadata.
- Phase 3 (next): Add richer job logs/events endpoints.
- Phase 4 (next): Add
RayServiceAPI support for long-running serving workloads.
Open Questions To Confirm
- Do we want organization/team scoped authz in this addon from day 1, or after internal validation?
- Should we add
RayServiceendpoints now or keep them for phase 4? - Do we standardize one namespace per account/team for all Ray resources?
Delete the Cluster
datalayer ray jobs ls --namespace default --cluster-name my-ray
datalayer ray clusters delete my-ray --namespace default