Datashim

Datashim needs to be deployed in the cloud to benefit from the Jupyter Content features.

helm repo add datashim https://datashim-io.github.io/datashim
helm repo update

Install Datashim.

plane up datalayer-datashim

Check the Datashim Pods.

kubectl get pods -n datalayer-datashim
# NAME                                READY   STATUS              RESTARTS   AGE
# csi-attacher-s3-0                   1/1     Running             0          8s
# csi-provisioner-s3-0                1/1     Running             0          8s
# csi-s3-2rllf                        2/2     Running             0          8s
# ...
# csi-s3-bkbkr                        2/2     Running             0          8s
# csi-s3-c4xv5                        2/2     Running             0          8s
# dataset-operator-7b55b587d4-xtd6q   1/1     Running   0          2m25s

S3 Secret

Create the secret for the S3 access so it can be reused in the Jupyter Environments.

kubectl create secret generic \
  s3-secret \
  --from-literal=access_key_id=$AWS_ACCESS_KEY_ID \
  --from-literal=secret_access_key=$AWS_SECRET_ACCESS_KEY \
  --from-literal=region=$AWS_DEFAULT_REGION \
  -n datalayer-runtimes
kubectl describe secret s3-secret -n datalayer-runtimes

q## Namespace Label

Datashim uses a mutating webhook with a namespaceSelector that requires the label monitor-pods-datasets=enabled on any namespace where pods should receive automatic dataset volume mounts.

The plane up datalayer-datashim command automatically labels the datalayer-runtimes namespace. For other namespaces (e.g. default for testing), add the label manually:

kubectl label namespace default monitor-pods-datasets=enabled

caution

Without this label, the datashim webhook will not inject volume mounts into pods, even if the pods have the correct dataset.0.id labels and the Dataset/PVC exist.

Validation

Validate the configuration with the creation of an example Dataset.

Step 1: Create a Dataset

cat <<EOF | kubectl apply -f -
apiVersion: datashim.io/v1alpha1
kind: Dataset
metadata:
  name: example-dataset
spec:
  local:
    type: COS
    accessKeyID: $AWS_ACCESS_KEY_ID
    secretAccessKey: $AWS_SECRET_ACCESS_KEY
    endpoint: https://s3.$AWS_DEFAULT_REGION.amazonaws.com
    bucket: datalayer-dev
    region: $AWS_DEFAULT_REGION
EOF

Verify the Dataset and PVC are created:

kubectl describe dataset example-dataset
kubectl get pvc example-dataset
# NAME              STATUS   VOLUME                                     CAPACITY        ACCESS MODES   STORAGECLASS
# example-dataset   Bound    pvc-c26adf05-...                           9765625000Ki    RWX            csi-s3

Step 2: Label the namespace

kubectl label namespace default monitor-pods-datasets=enabled

Step 3: Mount the Dataset in a Pod

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    dataset.0.id: example-dataset
    dataset.0.useas: mount
spec:
  containers:
    - name: nginx
      image: nginx
EOF

Verify the mount is injected and accessible:

kubectl get pod nginx
kubectl exec nginx -it -- ls /mnt/datasets/example-dataset

Step 4: Clean up

kubectl delete pod nginx
kubectl delete dataset example-dataset

Tear Down

If needed, tear down.

plane down datalayer-datashim

S3 Secret​

Validation​

Step 1: Create a Dataset​

Step 2: Label the namespace​

Step 3: Mount the Dataset in a Pod​

Step 4: Clean up​

Tear Down​