Upgrading the CSI Plugin
Because we specify spec.template.spec.priorityClassName = system-cluster-critical, we should get rescheduled
even if the server is low on resources. See this link for additional information.
From pod-priority-preemption, we can see that the priority-class is instructing the server to preempt lower priority PODs if needed.
On Production deployments, we would want to do a node upgrade manually, to verify that there is no service loss.
Upgrade Overview
Kubernetes supports two ways for upgrading resources:
OnDelete- once aPODis deleted, the new scheduledPODwill be running with upgraded specification. Using this strategy, we can choose whichPODwill be upgraded and we have mode control over the flow.RollingUpgrade- Once applied, Kubernetes will do the upgrade of the DaemonSet one by one on its own, without the ability to intervene if something goes wrong.
The manual approach is preferred, to make sure that there is no service loss while upgrading.
This is the flow we recommend for upgrading the CSI plugin:
- Upgrade the
lb-csi-nodeDaemonSetPODs manually, one by one. - Verify that the upgraded node is still working.
- Upgrade the
lb-csi-controllerStatefulSet. - Verify that the entire cluster is working.
Applying a Manual Upgrade
Manual flow:
- Stage #1: Modify DaemonSet's
spec.updateStrategytoOnDelete - Stage #2: Update DaemonSet
lb-csi-plugin image - Stage #3: Select One Node And Apply Upgrade and Verify
- Stage #4: Verify That The Updated POD Is Functioning Properly
- Stage #5: Upgrade the Remaining
lb-csi-nodePODs - Stage #6: Modify DaemonSet's
spec.updateStrategyback toRollinUpdate - Stage #7: Upgrade StatefulSet
Stage #1: Modify DaemonSet's spec.updateStrategy to OnDelete
kubectl patch ds/lb-csi-node -n kube-system -p '{"spec":{"updateStrategy":{"type":"OnDelete"}}}'daemonset.apps/lb-csi-node patched# verify changes appliedkubectl get ds/lb-csi-node -o go-template='{{.spec.updateStrategy.type}}{{"\n"}}' -n kube-systemOnDeleteStage #2: Update DaemonSet lb-csi-plugin Image
The only difference between the two DaemonSets is the lb-csi-plugin image:
< image: docker.lightbitslabs.com/lightos-csi/lb-csi-plugin:1.2.0---> image: docker.lightbitslabs.com/lightos-csi/lb-csi-plugin:1.4.2In case the discovery-client is deployed as a container in lb-csi-node POD we should add the following difference as well:
< image: docker.lightbitslabs.com/lightos-csi/lb-nvme-discovery-client:1.2.0---> image: docker.lightbitslabs.com/lightos-csi/lb-nvme-discovery-client:1.4.2The Docker registry prefix could vary between deployments.
Updating only the container image, use kubectl set image
kubectl set image ds/lb-csi-node -n kube-system lb-csi-plugin=docker.lightbitslabs.com/lightos-csi/lb-csi-plugin:1.4.2In case discovery-client is deployed as a container in lb-csi-node POD, run the following command as well:
kubectl set image ds/lb-csi-node -n kube-system lb-nvme-discovery-client=docker.lightbitslabs.com/lightos-csi/lb-nvme-discovery-client:1.4.2Stage #3: Select One Node and Apply Upgrade And Verify
We will specify how to manually upgrade the image in each of the PODs:
- List all the
lb-csi-pluginpods in the cluster:
kubectl get pods -n kube-system -l app=lb-csi-plugin -o wideNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATESlb-csi-controller-0 4/4 Running 0 117m 10.244.3.7 rack06-server63-vm04 <none> <none>lb-csi-node-rwrz6 3/3 Running 0 5m10s 192.168.20.61 rack06-server63-vm04 <none> <none>lb-csi-node-stzg6 3/3 Running 0 5m 192.168.20.84 rack06-server67-vm03 <none> <none>lb-csi-node-wc46m 3/3 Running 0 17h 192.168.16.114 rack09-server69-vm01 <none> <none>For this example, select the first lb-csi-node:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATESlb-csi-node-rwrz6 3/3 Running 0 5m10s 192.168.20.61 rack06-server63-vm04 <none> <none>- Delete the POD running on our selected server.
kubectl delete pods/lb-csi-node-rwrz6 -n kube-systempod "lb-csi-node-rwrz6" deleted- Verify that the
lb-csi-nodePOD is upgraded.
Listing the PODs again will show that one of them has a very short Age and it would have a different name:
kubectl get pods -n kube-system -l app=lb-csi-plugin -o wideNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATESlb-csi-node-g47z2 2/2 Running 0 39s 192.168.20.61 rack06-server63-vm04 <none> <none>We need to verify that it is Running.
We should also verify that the image was updated correctly by running the following command:
kubectl get pods lb-csi-node-g47z2 -n kube-system -o jsonpath='{.spec.containers[?(@.name=="lb-csi-plugin")].image}' ; echodocker.lightbitslabs.com/lightos-csi/lb-csi-plugin:1.4.2In case discovery-client is deployed as a container in the lb-csi-node POD, verify that its image was updated as well with the following command:
kubectl get pods lb-csi-node-tpd7d -n kube-system -o jsonpath='{.spec.containers[?(@.name=="lb-nvme-discovery-client")].image}' ; echodocker.lightbitslabs.com/lightos-csi/lb-nvme-discovery-client:1.4.2Stage #4: Verify that the Upgraded lb-csi-node POD is Functioning Properly
We will run a simple verification test to see that our node is still functioning before we move to the next node.
By deploying a simple workload on the upgraded node, we can verify that the lb-csi-node node is functioning properly.
We provide two ways to run the verification test:
- Using Static Manifests
- Using the Provided Helm Chart
Verify that the Upgraded Node is Using Static Manifests
Our verification test is very simple and has the following steps:
- Create an example
PVC. - Deploy a
PODconsuming thisPVCon upgraded node.
Create a manifest file named fs-workload.yaml containing the two kinds we want to deploy - PVC and POD:
apiVersionv1kindPersistentVolumeClaimmetadata nameexample-fs-after-upgrade-pvcspec storageClassName"<STORAGE-CLASS-NAME>" accessModesReadWriteOnce volumeModeFilesystem resources requests storage10Gi---apiVersionv1kindPodmetadata name"example-fs-after-upgrade-pod"spec nodeName"<NODE-NAME>" containersnamebusybox-date-container imagePullPolicyIfNotPresent imagebusybox command"/bin/sh" args"-c" "if [ -f /mnt/test/hostname ] ; then (md5sum -s -c /mnt/test/hostname.md5 && echo OLD MD5 OK || echo BAD OLD MD5) >> /mnt/test/log ; fi ; echo $KUBE_NODE_NAME: $(date +%Y-%m-%d.%H-%M-%S) >| /mnt/test/hostname ; md5sum /mnt/test/hostname >| /mnt/test/hostname.md5 ; echo NEW NODE: $KUBE_NODE_NAME: $(date +%Y-%m-%d.%H-%M-%S) >> /mnt/test/log ; while true ; do date +%Y-%m-%d.%H-%M-%S >| /mnt/test/date ; sleep 10 ; done" envnameKUBE_NODE_NAME valueFrom fieldRef fieldPathspec.nodeName stdintrue ttytrue volumeMountsnametest-mnt mountPath"/mnt/test" volumesnametest-mnt persistentVolumeClaim claimName"example-fs-after-upgrade-pvc"Make sure you modify the following fields that are cluster-specific:
storageClassName: The name of the SC configured in your cluster.nodeName: The name of the node we want to deploy on.Pod.spec.image: The name of the busybox image. Note that the Docker registry prefix could vary between deployments.
In order to get this, run the following commands:
kubectl get pods -n kube-system -l app=lb-csi-plugin -o wideNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATESlb-csi-controller-0 4/4 Running 0 117m 10.244.3.7 rack06-server63-vm04 <none> <none>lb-csi-node-rwrz6 3/3 Running 0 17h 192.168.20.61 rack06-server63-vm04 <none> <none>lb-csi-node-stzg6 3/3 Running 0 5m 192.168.20.84 rack06-server67-vm03 <none> <none>lb-csi-node-wc46m 3/3 Running 0 17h 192.168.16.114 rack09-server69-vm01 <none> <none>We can see that POD lb-csi-node-stzg6 was the one that had restarted and was updated, so we will set nodeName to be rack06-server67-vm03.
Apply the following command:
kubectl create -f fs-workload.yamlThe workload will write some files to the mounted volume. You can run the following command to see that the content is written to the volume:
kubectl exec -it pod/example-fs-after-upgrade-pod -- /bin/sh -c "cat /mnt/test/date ; cat /mnt/test/hostname; cat /mnt/test/hostname.md5"2021-05-23.08-13-10rack08-server52: 2021-05-23.08-03-3061afe45d31f826f5b7e54e6bd92ec07d /mnt/test/hostnameAfter a successful workload on the upgraded node, delete the tmp workload by running:
kubectl delete -f fs-workload.yamlVerify the Upgraded Node Using Helm
We will use the workload Helm chart provided with the bundle for this:
kubectl get storageclassNAME PROVISIONER RECLAIMPOLICY BINDINGMODE ALLOWVOLUMEEXPANSION AGElb-sc csi.lightbitslabs.com Delete Immediate false 2d12hWe will use the name of the StorageClass and the name of the upgraded node (rack06-server63-vm04) to deploy the FS pod workload.
helm install --set filesystem.enabled=true \ --set global.storageClass.name=lb-sc \ --set filesystem.nodeName=rack06-server63-vm04 \ fs-workload \ lb-csi-workload-examplesNow we need to verify that the PVC was Bound and that the POD is in Ready status.
kubectl get pv,pvc,podNAME STATUS CLAIM SC AGEpersistentvolume/pvc-6b26b875-fafd-4abe-95bb-2f5305b61a29 Bound default/example-fs-pvc lb-sc 12mNAME STATUS VOLUME SC AGEpersistentvolumeclaim/example-fs-pvc Bound pvc-6b26b875-fafd-4abe-95bb-2f5305b61a29 lb-sc 12mNAME READY STATUS RESTARTS AGEpod/example-fs-pod 1/1 Running 0 12mIf all is well, we can assume that the upgrade for that node worked.
Now we will uninstall the workload using the command:
helm delete fs-workloadStage #5: Upgrade Remaining lb-csi-node PODs
Repeat the following steps:
- Stage #3: Select One Node and Apply Upgrade And Verify
- Stage #4: Verify that the Upgraded
lb-csi-nodePODis Functioning Properly
Stage #6: Modify DaemonSet's spec.updateStrategy back to RollingUpdate
kubectl patch ds/lb-csi-node -n kube-system -p '{"spec":{"updateStrategy":{"type":"RollingUpdate"}}}'daemonset.apps/lb-csi-node patched# verify changes appliedkubectl get ds/lb-csi-node -o go-template='{{.spec.updateStrategy.type}}{{"\n"}}' -n kube-systemRollingUpdateStage #7: Upgrade StatefulSet
Since we have only one replica in the lb-csi-controller StatefulSet, there is no need to do a rolling upgrade.
Between the two spoken versions, there were many modifications to the StatefulSet, since we added Snapshots.
Snapshot requires the following resources to be deployed on the Kubernetes cluster:
Snapshot RBAC
ClusterRoleandClusterRoleBindings.Custom resource definitions:
- kind: VolumeSnapshot (version:
v0.4.0, apiVersion:apiextensions.k8s.io/v1 - kind: VolumeSnapshotClass (version:
v0.4.0, apiVersion:apiextensions.k8s.io/v1) - VolumeSnapshotContent (version:
v0.4.0, apiVersion:apiextensions.k8s.io/v1)
- kind: VolumeSnapshot (version:
Two additional containers in the
lb-csi-controllerPOD:\- name: snapshot-controller (v4.0.0)
- name: csi-snapshotter (v4.0.0)
Deploy ClusterRole and ClusterRoleBindings.
We assume that the Kubernetes cluster admin will know what is deployed on the system.
The following steps allow us to validate if we have the Roles and Bindings to work with snapshots.
If resources are not present on the cluster, these steps will guide you as to how to add them.
- Verify if we have
ClusterRoles for snapshots:
kubectl get clusterrole | grep snap# If we get empty response we will need to deploy the ClusterRoles, see step #3.# If we get the following output:external-snapshotter-runner 2d15hsnapshot-controller-runner 2d15h# It means that the roles are deployed and the Cluster-Admin need to make sure that the granted permissions are enough.- The same should be done with the ClusterRoleBindings:
kubectl get clusterrolebindings | grep snap# If we get empty response we will need to deploy the ClusterRoleBindings, see step #3.# If we get the following output:csi-snapshotter-role 2d15hsnapshot-controller-role 2d15h# It means that the roles are deployed and the Cluster-Admin need to make ClusterRoleBindings are assigned to the correct ServiceAccount.- Deploy
ClusterRolesandClusterRoleBindingusing the following command:
kubectl create -f snapshot -rbac.yamlclusterrole.rbac.authorization.k8s.io/snapshot -controller -runner createdclusterrole.rbac.authorization.k8s.io/external -snapshotter -runner createdclusterrolebinding.rbac.authorization.k8s.io/snapshot -controller -role createdclusterrolebinding.rbac.authorization.k8s.io/csi -snapshotter -role created- Deploy Snapshot CRDS.
We need to understand if we have the snapshot CRDs deployed already on the cluster.
kubectl get crd -o jsonpath='{range .items[*]}{@.spec.names.kind}{" , "}{@.apiVersion}{" , "}{@.metadata.annotations.controller-gen\.kubebuilder\.io/version}{"\n"}{end}' ;echoIf we see output like this, we already have CRD deployed on the cluster and we can skip adding them:
VolumeSnapshotClass , apiextensions.k8s.io/v1 , v0.4.0VolumeSnapshotContent , apiextensions.k8s.io/v1 , v0.4.0VolumeSnapshot , apiextensions.k8s.io/v1 , v0.4.0If we get no output, it means that we do not have CRDs deployed and we need to deploy them as follows:
kubectl create -f snapshot-crds.yamlcustomresourcedefinition.apiextensions.k8s.io/volumesnapshotclasses.snapshot.storage.k8s.io createdcustomresourcedefinition.apiextensions.k8s.io/volumesnapshotcontents.snapshot.storage.k8s.io createdcustomresourcedefinition.apiextensions.k8s.io/volumesnapshots.snapshot.storage.k8s.io created- Upgrade
lb-csi-controllerStatefulSet
The Docker registry prefix could vary between deployments. Please verify the image prefix before running.
kubectl apply -f stateful-set.yamlWarning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl applystatefulset.apps/lb-csi-controller configuredVerify StatefulSet And DaemonSet Version As Expected
List all CSI plugin pods:
kubectl get pods -n kube-system -l app=lb-csi-pluginNAME READY STATUS RESTARTS AGElb-csi-controller-0 6/6 Running 0 3m33slb-csi-node-k4bzk 3/3 Running 0 13mlb-csi-node-pcsmm 3/3 Running 0 13mlb-csi-node-z7lpr 3/3 Running 0 13mVerify that the version-rel matches the expected version.
For the controller pod:
kubectl logs -n kube-system lb-csi-controller-0 -c lb-csi-plugin | grep version-reltime="2021-03-21T18:50:54.410655+00:00" level=info msg=starting config="{NodeID:rack06-server63-vm04.ctrl Endpoint:unix:///var/lib/csi/sockets/pluginproxy/csi.sock DefaultFS:ext4 LogLevel:debug LogRole:controller LogTimestamps:true LogFormat:text BinaryName: Transport:tcp SquelchPanics:true PrettyJson:false}" driver-name=csi.lightbitslabs.com node=rack06-server63-vm04.ctrl role=controller version-build-id= version-git$v1.4.2-0-gaf08f7e0 version-hash=1.4.2 version-rel=1.4.2The same for each node pod:
kubectl logs -n kube-system lb-csi-node-k4bzk -c lb-csi-plugin | grep version-reltime="2021-03-21T18:41:18.750957+00:00" level=info msg=starting config="{NodeID:rack06-server63-vm04.node Endpoint:unix:///csi/csi.sock DefaultFS:ext4 LogLevel:debug LogRole:node LogTimestamps:true LogFormat:text BinaryName: Transport:tcp SquelchPanics:true PrettyJson:false}" driver-name=csi.lightbitslabs.com node=rack06-server63-vm04.node role=node version-build-id= version-git=v1.4.2-0-gaf08f7e0 version-hash=1.4.2 version-rel=1.4.2Applying RollingUpgrade (Automated Deployment)
Checking DaemonSet Update Strategy
kubectl get ds/lb-csi-node -o go-template='{{.spec.updateStrategy.type}}{{"\n"}}' -n kube-systemChecking StatefulSet Update Strategy
kubectl get sts/lb-csi-controller -o go-template='{{.spec.updateStrategy.type}}{{"\n"}}' -n kube-systemRollout History
Each time we deploy the DaemonSet, a new rollout will be created.
This can be viewed using the following command:
kubectl rollout history daemonset lb-csi-node -n kube-systemdaemonset.apps/lb-csi-node REVISION CHANGE-CAUSE1 <none>The same can be seen for ReplicaSet resources:
kubectl rollout history statefulset lb-csi-controller -n kube-systemstatefulset.apps/lb-csi-controller REVISION12Rollout Status
We can verify the status of a rollout using the following command:
kubectl rollout status daemonset lb-csi-node -n kube-systemdaemon set "lb-csi-node" successfully rolled outVerify StatefulSet And DaemonSet Version As Expected
List all CSI plugin pods:
kubectl get pods -n kube-system -l app=lb-csi-pluginNAME READY STATUS RESTARTS AGElb-csi-controller-0 6/6 Running 0 3m33slb-csi-node-k4bzk 2/2 Running 0 13mlb-csi-node-pcsmm 2/2 Running 0 13mlb-csi-node-z7lpr 2/2 Running 0 13mVerify that the version-rel matches the expected version.
For the controller pod:
kubectl logs -n kube-system lb-csi-controller-0 -c lb-csi-plugin | grep version-reltime="2021-03-21T18:50:54.410655+00:00" level=info msg=starting config="{NodeID:rack06-server63-vm04.ctrl Endpoint:unix:///var/lib/csi/sockets/pluginproxy/csi.sock DefaultFS:ext4 LogLevel:debug LogRole:controller LogTimestamps:true LogFormat:text BinaryName: Transport:tcp SquelchPanics:true PrettyJson:false}" driver-name=csi.lightbitslabs.com node=rack06-server63-vm04.ctrl role=controller version-build-id= version-git$v1.4.2-0-gaf08f7e0 version-hash=1.4.2 version-rel=1.4.2The same for each node pod:
kubectl logs -n kube-system lb-csi-node-k4bzk -c lb-csi-plugin | grep version-reltime="2021-03-21T18:41:18.750957+00:00" level=info msg=starting config="{NodeID:rack06-server63-vm04.node Endpoint:unix:///csi/csi.sock DefaultFS:ext4 LogLevel:debug LogRole:node LogTimestamps:true LogFormat:text BinaryName: Transport:tcp SquelchPanics:true PrettyJson:false}" driver-name=csi.lightbitslabs.com node=rack06-server63-vm04.node role=node version-build-id= version-git=v1.4.2-0-gaf08f7e0 version-hash=1.4.2 version-rel=1.4.2Rollback DaemonSet
In case nothing works, we can roll back.
kubectl rollout undo daemonset lb-csi-node -n kube-system daemonset.apps/lb-csi-node rolled backNow we can see again that the rollout has changed and that we got a new ControllerRevision (always incrementing):
kubectl rollout history daemonset lb-csi-node -n kube-system daemonset.apps lb-csi-nodeREVISION CHANGE-CAUSE 2 <none> 3 <none>Rollback StatefulSet
kubectl rollout undo statefulset lb-csi-controller -n kube-system statefulset.apps/lb-csi-controller rolled backkubectl rollout history statefulset lb-csi-controller -n kube-systemstatefulset.apps/lb-csi-controller REVISION23Verify that the Upgraded Cluster Is Working
Once you have completed all operations for the upgrade, you should run different workloads to verify that all is functioning properly:
- Create block PVC,POD.
- Create filesystem PVC,POD.
- Create snapshots, clones, clone
PVCs.
You can use the workload examples provided with the lb-csi-bundle-<version>.tar.gz of the target version.