Skip to main content

Datashim

Datashim needs to be deployed in the cloud to benefit from the Jupyter Content features.

helm repo add datashim https://datashim-io.github.io/datashim
helm repo update

Install Datashim.

plane up datalayer-datashim

Check the Datashim Pods.

kubectl get pods -n datalayer-datashim
# NAME READY STATUS RESTARTS AGE
# csi-attacher-s3-0 1/1 Running 0 8s
# csi-provisioner-s3-0 1/1 Running 0 8s
# csi-s3-2rllf 2/2 Running 0 8s
# ...
# csi-s3-bkbkr 2/2 Running 0 8s
# csi-s3-c4xv5 2/2 Running 0 8s
# dataset-operator-7b55b587d4-xtd6q 1/1 Running 0 2m25s

S3 Secret

Create the secret for the S3 access so it can be reused in the Jupyter Environments.

kubectl create secret generic \
s3-secret \
--from-literal=access_key_id=$AWS_ACCESS_KEY_ID \
--from-literal=secret_access_key=$AWS_SECRET_ACCESS_KEY \
--from-literal=region=$AWS_DEFAULT_REGION \
-n datalayer-runtimes
kubectl describe secret s3-secret -n datalayer-runtimes

q## Namespace Label

Datashim uses a mutating webhook with a namespaceSelector that requires the label monitor-pods-datasets=enabled on any namespace where pods should receive automatic dataset volume mounts.

The plane up datalayer-datashim command automatically labels the datalayer-runtimes namespace. For other namespaces (e.g. default for testing), add the label manually:

kubectl label namespace default monitor-pods-datasets=enabled
caution

Without this label, the datashim webhook will not inject volume mounts into pods, even if the pods have the correct dataset.0.id labels and the Dataset/PVC exist.

Validation

Validate the configuration with the creation of an example Dataset.

Step 1: Create a Dataset

cat <<EOF | kubectl apply -f -
apiVersion: datashim.io/v1alpha1
kind: Dataset
metadata:
name: example-dataset
spec:
local:
type: COS
accessKeyID: $AWS_ACCESS_KEY_ID
secretAccessKey: $AWS_SECRET_ACCESS_KEY
endpoint: https://s3.$AWS_DEFAULT_REGION.amazonaws.com
bucket: datalayer-dev
region: $AWS_DEFAULT_REGION
EOF

Verify the Dataset and PVC are created:

kubectl describe dataset example-dataset
kubectl get pvc example-dataset
# NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS
# example-dataset Bound pvc-c26adf05-... 9765625000Ki RWX csi-s3

Step 2: Label the namespace

kubectl label namespace default monitor-pods-datasets=enabled

Step 3: Mount the Dataset in a Pod

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
dataset.0.id: example-dataset
dataset.0.useas: mount
spec:
containers:
- name: nginx
image: nginx
EOF

Verify the mount is injected and accessible:

kubectl get pod nginx
kubectl exec nginx -it -- ls /mnt/datasets/example-dataset

Step 4: Clean up

kubectl delete pod nginx
kubectl delete dataset example-dataset

Tear Down

If needed, tear down.

plane down datalayer-datashim