Datashim
Datashim needs to be deployed in the cloud to benefit from the Jupyter Content features.
helm repo add datashim https://datashim-io.github.io/datashim
helm repo update
Install Datashim.
plane up datalayer-datashim
Check the Datashim Pods.
kubectl get pods -n datalayer-datashim
# NAME READY STATUS RESTARTS AGE
# csi-attacher-s3-0 1/1 Running 0 8s
# csi-provisioner-s3-0 1/1 Running 0 8s
# csi-s3-2rllf 2/2 Running 0 8s
# ...
# csi-s3-bkbkr 2/2 Running 0 8s
# csi-s3-c4xv5 2/2 Running 0 8s
# dataset-operator-7b55b587d4-xtd6q 1/1 Running 0 2m25s
S3 Secret
Create the secret for the S3 access so it can be reused in the Jupyter Environments.
kubectl create secret generic \
s3-secret \
--from-literal=access_key_id=$AWS_ACCESS_KEY_ID \
--from-literal=secret_access_key=$AWS_SECRET_ACCESS_KEY \
--from-literal=region=$AWS_DEFAULT_REGION \
-n datalayer-runtimes
kubectl describe secret s3-secret -n datalayer-runtimes
q## Namespace Label
Datashim uses a mutating webhook with a namespaceSelector that requires the label monitor-pods-datasets=enabled on any namespace where pods should receive automatic dataset volume mounts.
The plane up datalayer-datashim command automatically labels the datalayer-runtimes namespace. For other namespaces (e.g. default for testing), add the label manually:
kubectl label namespace default monitor-pods-datasets=enabled
Without this label, the datashim webhook will not inject volume mounts into pods, even if the pods have the correct dataset.0.id labels and the Dataset/PVC exist.
Validation
Validate the configuration with the creation of an example Dataset.
Step 1: Create a Dataset
cat <<EOF | kubectl apply -f -
apiVersion: datashim.io/v1alpha1
kind: Dataset
metadata:
name: example-dataset
spec:
local:
type: COS
accessKeyID: $AWS_ACCESS_KEY_ID
secretAccessKey: $AWS_SECRET_ACCESS_KEY
endpoint: https://s3.$AWS_DEFAULT_REGION.amazonaws.com
bucket: datalayer-dev
region: $AWS_DEFAULT_REGION
EOF
Verify the Dataset and PVC are created:
kubectl describe dataset example-dataset
kubectl get pvc example-dataset
# NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS
# example-dataset Bound pvc-c26adf05-... 9765625000Ki RWX csi-s3
Step 2: Label the namespace
kubectl label namespace default monitor-pods-datasets=enabled
Step 3: Mount the Dataset in a Pod
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
dataset.0.id: example-dataset
dataset.0.useas: mount
spec:
containers:
- name: nginx
image: nginx
EOF
Verify the mount is injected and accessible:
kubectl get pod nginx
kubectl exec nginx -it -- ls /mnt/datasets/example-dataset
Step 4: Clean up
kubectl delete pod nginx
kubectl delete dataset example-dataset
Tear Down
If needed, tear down.
plane down datalayer-datashim