Using TALM to Update Clusters

In this section, we will perform a platform upgrade to both managed clusters using the pre-cache and backup feature implemented in the Topology Aware Lifecycle Manager (TALM) operator. The pre-cache feature prepares the maintenance operation in the managed clusters by pulling the required artifacts prior to the upgrade. The reasoning behind this feature is that SNO spoke clusters may have limited bandwidth to the container registry, which will make it difficult for the upgrade to complete within the required time. In order to ensure the upgrade can fit within the maintenance window, the required artifacts need to be present on the spoke cluster prior to the upgrade. The idea is pre-caching all the images needed for the platform and operator upgrade on the node, so they are not pulled at upgrade time. Do it in a maintenance window(s) before the upgrade maintenance window.

The backup feature, on the other hand, implements a procedure for rapid recovery of a SNO in the event of a failed upgrade that is unrecoverable. The SNO needs to be restored to a working state with the previous version of OCP without requiring a re-provision of the application(s).

The backup feature only allows SNOs to be restored, this is not applicable to any other kind of OpenShift clusters.

Let’s upgrade our both clusters. First of all, let’s verify the TALM operator is running in our hub cluster:

Verifying the TALM state

Below commands must be executed from the workstation host if not specified otherwise.

oc --kubeconfig ~/5g-deployment-lab/hub-kubeconfig get operators

NAME                                                   AGE
advanced-cluster-management.open-cluster-management    16h
lvms-operator.openshift-storage                        16h
multicluster-engine.multicluster-engine                16h
openshift-gitops-operator.openshift-operators          16h
topology-aware-lifecycle-manager.openshift-operators   16h

Next, double check there is no problem with the Pod. Notice that the name of the Pod is cluster-group-upgrade-controller-manager, based on the name of the upstream project Cluster Group Upgrade Operator.

oc --kubeconfig ~/5g-deployment-lab/hub-kubeconfig get pods,sa,deployments -n openshift-operators

NAME                                                             READY   STATUS    RESTARTS   AGE
pod/cluster-group-upgrades-controller-manager-75b967b749-qlzn8   2/2     Running   0          3h3m
pod/gitops-operator-controller-manager-7b6b8967b8-4f8rx          1/1     Running   0          3h6m

NAME                                                                SECRETS   AGE
serviceaccount/builder                                              1         3h17m
serviceaccount/cluster-group-upgrades-controller-manager            1         3h3m
serviceaccount/cluster-group-upgrades-operator-controller-manager   1         3h3m
serviceaccount/default                                              1         3h43m
serviceaccount/deployer                                             1         3h17m
serviceaccount/gitops-operator-controller-manager                   1         3h6m

NAME                                                        READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/cluster-group-upgrades-controller-manager   1/1     1            1           3h3m
deployment.apps/gitops-operator-controller-manager          1/1     1            1           3h6m

Finally, let’s take a look at the cluster group upgrade (CGU) CRD managed by TALM. If we pay a closer look we will notice that an already completed CGU was applied to SNO2. As we mentioned in inform policies section, all policies are not enforced, the user has to create the proper CGU resource to enforce them. However, when using ZTP, we want our cluster provisioned and configured automatically. This is where TALM will step through the set of created policies (inform) and will enforce them once the cluster was successfully provisioned. Therefore, the configuration stage starts without any intervention ending up with our OpenShift cluster ready to process workloads.

It’s possible that you get UpgradeNotCompleted, if that’s the case you need to wait for the remaining policies to be applied. You can check policies status here.

oc --kubeconfig ~/5g-deployment-lab/hub-kubeconfig get cgu sno2 -n ztp-install

NAMESPACE     NAME            AGE     STATE       DETAILS
ztp-install   sno2            79m     Completed   All clusters are compliant with all the managed policies

Getting the SNO clusters kubeconfigs

In the previous sections we have deployed the sno2 cluster and attached the sno1 cluster. Before we continue with TALM, let’s grab the kubeconfigs for both cluster since we will need them for the next sections.

oc --kubeconfig ~/5g-deployment-lab/hub-kubeconfig -n sno1 extract secret/sno1-admin-kubeconfig --to=- > ~/5g-deployment-lab/sno1-kubeconfig
oc --kubeconfig ~/5g-deployment-lab/hub-kubeconfig -n sno2 extract secret/sno2-admin-kubeconfig --to=- > ~/5g-deployment-lab/sno2-kubeconfig

Creating the upgrade PGT

Create an upgrade PGT in inform mode, as usual, that will apply and upgrade the SNOs located in Europe (binding rule: du-zone: "europe"), SNO1 and SNO2 clusters. This file needs to be created in the ztp-repository Git repo that we have created.

cat <<EOF > ~/5g-deployment-lab/ztp-repository/site-policies/fleet/active/zone-europe-upgrade-414-1.yaml
---
apiVersion: ran.openshift.io/v1
kind: PolicyGenTemplate
metadata:
  name: "europe-snos-upgrade"
  namespace: "ztp-policies"
spec:
  bindingRules:
    du-zone: "europe"
    logicalGroup: "active"
  mcp: "master"
  remediationAction: inform
  sourceFiles:
    - fileName: ClusterVersion.yaml
      policyName: "version-414-1"
      metadata:
        name: version
      spec:
        channel: "stable-4.14"
        desiredUpdate:
          force: false
          version: "4.14.1"
          image: "infra.5g-deployment.lab:8443/openshift/release-images@sha256:05ba8e63f8a76e568afe87f182334504a01d47342b6ad5b4c3ff83a2463018bd"
      status:
        history:
          - version: "4.14.1"
            state: "Completed"
EOF

Modify the kustomization.yaml inside the site-policies folder, so it includes this new PGT and eventually will be applied by ArgoCD.

If you’re using MacOS and you’re getting errors while running sed -i commands, make sure you are using gsed. If you do not have it available, please install it: brew install gnu-sed.

sed -i "/- group-du-sno.yaml/a \ \ - zone-europe-upgrade-414-1.yaml" ~/5g-deployment-lab/ztp-repository/site-policies/fleet/active/kustomization.yaml

Then commit all the changes:

cd ~/5g-deployment-lab/ztp-repository/
git add site-policies/fleet/active/zone-europe-upgrade-414-1.yaml site-policies/fleet/active/kustomization.yaml
git commit -m "adds upgrade policy"
git push origin main

Once committed, in a couple of minutes we will see a new policy in the multicloud RHACM console. As noticed, the policy named europe-snos-upgrade-version-414-1 has a violation. However, only one cluster is not compliant. If we check the policy information we will see that this policy is only targeting SNO2 cluster. That’s because the SNO1 cluster does not have the labels zone-europe and active logicalGroup that the PGT is targeting in its binding rule.

Notice in the following picture that there are policies not applying to any cluster. Those are policies targeting the test environment. This is expected since we do not have any clusters in the test environment, e.g, no clusters are labeled with the proper label for testing: logicalGroup=test.

Let’s add the proper labels to the production cluster SNO1:

oc --kubeconfig ~/5g-deployment-lab/hub-kubeconfig label managedcluster sno1 du-zone=europe logicalGroup=active

managedcluster.cluster.open-cluster-management.io/sno1 labeled

Applying the upgrade

At this point, we need to create a Cluster Group Upgrade (CGU) resource that will start the upgrade process. In our case, the process will be divided into two stages:

Run a pre-cache of the new OCP release prior to start the upgrade process.
Before running the upgrade, a backup will be done.

Backup and pre-cache

Let’s create the CGU. In this case, we will apply the managed policy (europe-snos-upgrade-version-414-1) to both clusters at the same time (maxConcurrency is 2). Notice that the CGU is disabled, this is suggested if we are going to run the precaching feature. This means, that once the precaching process is done we are ready to start the upgrade process by enabling the CGU. This idea is related to the compliance of a maintenance window in an enterprise.

Remember that several gigabytes of artifacts needs to be downloaded to the spoke for a full upgrade. SNO spoke clusters may have limited bandwidth to the hub cluster hosting the registry, which will make it difficult for the upgrade to complete within the required time. In order to ensure the upgrade can fit within the maintenance window, the required artifacts need to be present on the spoke cluster prior to the upgrade. Therefore, the process is split up into two stages as mentioned.

In OCP 4.14+, there is a new CRD called PreCachingConfig that will allow us to be more precise on the container images that we need for our cluster to upgrade. We must apply the {talm-precachingconfig-doc}[PreCachingConfig CR] before or concurrently with the CGU to our hub cluster:

You can obtain a more detailed list for excludePrecachePattern for each upgrade by following this KCS.

cat <<EOF | oc --kubeconfig ~/5g-deployment-lab/hub-kubeconfig apply -f -
---
apiVersion: ran.openshift.io/v1alpha1
kind: PreCachingConfig
metadata:
  name: update-europe-snos
  namespace: ztp-policies
spec:
  overrides: {}
  excludePrecachePatterns:
    - agent-installer-
    - alibaba-
    - aws-
    - azure-
    - cloud-
    - gcp-
    - ibmcloud
    - ibm-
    - nutanix-
    - openstack-
    - ovirt-
    - powervs-
    - sdn
    - vsphere-
    - kuryr-
    - csi-
    - hypershift
  additionalImages: []
---
apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
  name: update-europe-snos
  namespace: ztp-policies
spec:
  preCaching: true
  preCachingConfigRef:
    name: update-europe-snos
    namespace: ztp-policies
  backup: true
  clusters:
  - sno1
  - sno2
  enable: false
  managedPolicies:
  - europe-snos-upgrade-version-414-1
  remediationStrategy:
    maxConcurrency: 2
    timeout: 240
EOF

Once applied, we can see that the status moved to InProgress with a message detailing that the precaching process is in progress for both SNOs. This means that the first step in our process is executing the pre-cache.

oc --kubeconfig ~/5g-deployment-lab/hub-kubeconfig get cgu -A

NAMESPACE      NAME                 AGE    STATE        DETAILS
ztp-install    local-cluster        3h8m   Completed    All clusters already compliant with the specified managed policies
ztp-install    sno1                 15m    Completed    All clusters already compliant with the specified managed policies
ztp-install    sno2                 86m    Completed    All clusters are compliant with all the managed policies
ztp-policies   update-europe-snos   20s    InProgress   Precaching in progress for 2 clusters

Connecting to any of our spoke clusters we can see a new job being created called pre-cache.

Pre-cache job can take up to 5m to be created.

oc --kubeconfig ~/5g-deployment-lab/sno2-kubeconfig -n openshift-talo-pre-cache get job pre-cache

NAME        COMPLETIONS   DURATION   AGE
pre-cache   0/1           64s        64s

This job creates a Pod that will run the precache process. As we can see below, 183 images need to be downloaded from our local registry to mark the task as successful.

oc --kubeconfig ~/5g-deployment-lab/sno2-kubeconfig logs job/pre-cache -n openshift-talo-pre-cache -f

highThresholdPercent: 85 diskSize:209124332 used:17648032
upgrades.pre-cache 2023-11-21T10:55:10+00:00 DEBUG Release index image processing done
7df5fe3b5fb7352b870735c7d7bd898d0959a9a49558d2ffb42dcd269e01752f
upgrades.pre-cache 2023-11-21T10:55:10+00:00 DEBUG Operators index is not specified. Operators won't be pre-cached
upgrades.pre-cache 2023-11-21T10:55:10+00:00 DEBUG Pulling quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:00162a72c1ae283977f0191ee216e15fe696838b6d7addd8250ff8c5b474cc61 [1/183]
upgrades.pre-cache 2023-11-21T10:55:10+00:00 DEBUG Pulling quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:00eaf204536112ed09a10e4c70f9a8dc6827726bf9bc34f279a9156b881a7a2a [2/183]
upgrades.pre-cache 2023-11-21T10:55:10+00:00 DEBUG Pulling quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:02d61dfc59ac70096dffadda38a52829cd61c9c016e54d2b6d78eb5182d2b19a [3/183]
.
.
.
upgrades.pre-cache 2023-11-21T11:00:48+00:00 DEBUG Pulling quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ff7cfeec16898c293222ac1422841440cdeffefa7d489757e71999d5305425f8 [182/183]
upgrades.pre-cache 2023-11-21T11:00:48+00:00 DEBUG Pulling quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ff88f1a78dac067bb93d98818bcee9bed36de0b2a74f3ed42d15ad816a16f624 [183/183]
upgrades.pre-cache 2023-11-21T11:00:48+00:00 DEBUG Image pre-cache done

Once the precache is done, the CGU state moves to NotEnabled and the Pod running the pre-cache task in both SNO clusters is deleted. At this moment, TALM is waiting for acknowledging the start of the upgrade.

It can take up to 5 minutes for the CGU to report the new state. After that time, the precache objects and the openshift-talo-pre-cache namespace created in the managed clusters are automatically deleted.

oc --kubeconfig ~/5g-deployment-lab/hub-kubeconfig get cgu -n ztp-policies update-europe-snos

NAMESPACE      NAME                 AGE     STATE        DETAILS
ztp-policies   update-europe-snos   29m     NotEnabled   Not enabled

Triggering the upgrade

Now that the pre-cache is done, we can trigger the update. As we said earlier, before the update is actually executed a backup will be done so we can rollback. In order to trigger the update we need to enable the CGU:

oc --kubeconfig ~/5g-deployment-lab/hub-kubeconfig patch cgu update-europe-snos -n ztp-policies --type merge --patch '{"spec":{"enable":true}}'

Notice the CGU state moved to InProgress, which means, the upgrade process has started. In the details you can see that the backup is in progress.

oc --kubeconfig ~/5g-deployment-lab/hub-kubeconfig get cgu -n ztp-policies update-europe-snos

NAMESPACE      NAME                 AGE     STATE        DETAILS
ztp-policies   update-europe-snos   30m     InProgress   Backup in progress for 2 clusters

Connecting to any of our spoke clusters we can see a new job being created called backup-agent.

Backup job can take up to 5m to be created.

oc --kubeconfig ~/5g-deployment-lab/sno1-kubeconfig get jobs -A

NAMESPACE                              NAME                            COMPLETIONS   DURATION   AGE
assisted-installer                     assisted-installer-controller   1/1           26m        21h
openshift-image-registry               image-pruner-28140480           1/1           5s         11h
openshift-operator-lifecycle-manager   collect-profiles-28141110       1/1           5s         34m
openshift-operator-lifecycle-manager   collect-profiles-28141125       1/1           4s         19m
openshift-operator-lifecycle-manager   collect-profiles-28141140       1/1           4s         4m50s
openshift-talo-backup                  backup-agent                    0/1           7s         7s

This job basically runs a Pod that will execute a recovery procedure and will store all required data into the /var/recovery folder of each spoke.

oc --kubeconfig ~/5g-deployment-lab/sno2-kubeconfig logs job/backup-agent -n openshift-talo-backup -f

INFO[0000] ------------------------------------------------------------
INFO[0000] Cleaning up old content...
INFO[0000] ------------------------------------------------------------
INFO[0000] Old directories deleted with contents
INFO[0000] Old contents have been cleaned up
INFO[0000] Available disk space : 154.98 GiB; Estimated disk space required for backup: 276.55 MiB
INFO[0000] Sufficient disk space found to trigger backup
INFO[0000] Upgrade recovery script written
INFO[0000] Running: bash -c /var/recovery/upgrade-recovery.sh --take-backup --dir /var/recovery
INFO[0000] ##### Wed Jul  5 10:57:53 UTC 2023: Taking backup
INFO[0000] ##### Wed Jul  5 10:57:53 UTC 2023: Wiping previous deployments and pinning active
INFO[0000] error: Out of range deployment index 1, expected < 1
INFO[0000] Deployment 0 is now pinned
INFO[0000] ##### Wed Jul  5 10:57:54 UTC 2023: Backing up container cluster and required files
INFO[0000] Certificate /etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt is missing. Checking in different directory
INFO[0000] Certificate /etc/kubernetes/static-pod-resources/etcd-certs/configmaps/etcd-serving-ca/ca-bundle.crt found!
INFO[0000] found latest kube-apiserver: /etc/kubernetes/static-pod-resources/kube-apiserver-pod-7
INFO[0000] found latest kube-controller-manager: /etc/kubernetes/static-pod-resources/kube-controller-manager-pod-11
INFO[0000] found latest kube-scheduler: /etc/kubernetes/static-pod-resources/kube-scheduler-pod-8
INFO[0000] found latest etcd: /etc/kubernetes/static-pod-resources/etcd-pod-3
INFO[0000] aa91d2b9d4ac8ef60ea81f11643fdaad14717e21a7a4f82b57f59667a02c92af
INFO[0000] etcdctl version: 3.5.6
INFO[0000] API version: 3.5
INFO[0000] {"level":"info","ts":"2023-11-21T10:57:54.421Z","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/var/recovery/cluster/snapshot_2023-07-05_105754__POSSIBLY_DIRTY__.db.part"}
INFO[0000] {"level":"info","ts":"2023-11-21T10:57:54.434Z","logger":"client","caller":"v3@v3.5.6/maintenance.go:212","msg":"opened snapshot stream; downloading"}
INFO[0000] {"level":"info","ts":"2023-11-21T10:57:54.434Z","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://192.168.125.40:2379"}
INFO[0001] {"level":"info","ts":"2023-11-21T10:57:55.204Z","logger":"client","caller":"v3@v3.5.6/maintenance.go:220","msg":"completed snapshot read; closing"}
INFO[0001] {"level":"info","ts":"2023-11-21T10:57:55.290Z","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"https://192.168.125.40:2379","size":"84 MB","took":"now"}
INFO[0001] {"level":"info","ts":"2023-11-21T10:57:55.290Z","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"/var/recovery/cluster/snapshot_2023-07-05_105754__POSSIBLY_DIRTY__.db"}
INFO[0001] Snapshot saved at /var/recovery/cluster/snapshot_2023-07-05_105754__POSSIBLY_DIRTY__.db
INFO[0001] Deprecated: Use `etcdutl snapshot status` instead.
INFO[0001]
INFO[0001] {"hash":1007949646,"revision":43716,"totalKey":10582,"totalSize":83578880}
INFO[0001] snapshot db and kube resources are successfully saved to /var/recovery/cluster
INFO[0002] Command succeeded: cp -Ra /etc/ /var/recovery/etc/
INFO[0002] Command succeeded: cp -Ra /usr/local/ /var/recovery/usrlocal/
INFO[0002] Command succeeded: cp -Ra /var/lib/kubelet/ /var/recovery/kubelet/
INFO[0002] ##### Wed Jul  5 10:57:55 UTC 2023: Backup complete
INFO[0002] ------------------------------------------------------------
INFO[0002] backup has successfully finished ...

Once backups are finished for all clusters, the CGU state will move to BackupCompleted and then quickly move to InProgress:

oc --kubeconfig ~/5g-deployment-lab/hub-kubeconfig get cgu -A

NAMESPACE      NAME                 AGE    STATE             DETAILS
ztp-install    sno2                 4h9m   Completed         All clusters are compliant with all the managed policies
ztp-policies   update-europe-snos   28m    BackupCompleted   Backup is completed for all clusters

At this point, if we connect to any of our spoke clusters we can see that the upgrade process is actually taking place.

oc --kubeconfig ~/5g-deployment-lab/sno2-kubeconfig get clusterversion,nodes

NAME                                         VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
clusterversion.config.openshift.io/version   4.14.0    True        True          30s     Working towards 4.14.1: 101 of 842 done (11% complete), waiting on kube-apiserver

NAME                          STATUS   ROLES                         AGE    VERSION
node/sno2.5g-deployment.lab   Ready    control-plane,master,worker   156m   v1.27.6+f67aeb3

Meanwhile, the clusters are upgrading we can take a look at the multicloud console and see that there is a new policy in enforce mode:

Moving to the Infrastructure → Cluster section of the multicloud console we can also graphically see the upgrading of both clusters:

Finally, our clusters are upgraded:

oc --kubeconfig ~/5g-deployment-lab/sno2-kubeconfig get clusterversion,nodes

NAME                                         VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
clusterversion.config.openshift.io/version   4.14.1    True        False         6m8s    Cluster version is 4.14.1

NAME                          STATUS   ROLES                         AGE     VERSION
node/sno2.5g-deployment.lab   Ready    control-plane,master,worker   3h26m   v1.27.6+f67aeb3

oc --kubeconfig ~/5g-deployment-lab/sno1-kubeconfig get clusterversion,nodes

NAME                                         VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
clusterversion.config.openshift.io/version   4.14.1    True        False         9m12s   Cluster version is 4.14.1

NAME                      STATUS   ROLES                         AGE   VERSION
node/openshift-master-0   Ready    control-plane,master,worker   22h   v1.27.6+f67aeb3

Notice that now the upgrade policy europe-snos-upgrade-version-414-1 is now compliant on both clusters. See that, in order to save resources, the enforce policy is removed once the CGU is successfully applied.

And finally, the CGU will be Completed:

oc --kubeconfig ~/5g-deployment-lab/hub-kubeconfig get cgu -A

NAMESPACE      NAME                 AGE    STATE       DETAILS
ztp-install    local-cluster        23h    Completed   All clusters already compliant with the specified managed policies
ztp-install    sno1                 67m    Completed   All clusters already compliant with the specified managed policies
ztp-install    sno2                 3h5m   Completed   All clusters are compliant with all the managed policies
ztp-policies   update-europe-snos   63m    Completed   All clusters are compliant with all the managed policies