Image-based upgrade for Single-node OpenShift Clusters

Introduction to Image-based Upgrades

In this section, we will perform an image-based upgrade to both managed clusters (sno-abi and sno-ibi) by leveraging the Lifecycle Agent (LCA) and the Topology Aware Lifecycle Manager (TALM) operators. The LCA provides an alternative way to upgrade the platform version of a single-node OpenShift cluster, while TALM manages the rollout of configurations throughout the lifecycle of the cluster fleet.

The image-based upgrade is faster than the standard upgrade method and allows direct upgrades from OpenShift Container Platform <4.y> to <4.y+2>, and <4.y.z> to <4.y.z+n>. These upgrades utilize a generated OCI image from a dedicated seed cluster. We have already provided a seed image with version 4.18.4. The process to generate this seed image is the same as the one we executed in Creating the Seed Image for version 4.18.3 to install sno-ibi.

Image-based upgrades rely on custom images specific to the hardware platform on which the clusters are running. Each distinct hardware platform requires its own seed image. You can find more information here.

Below are the steps to upgrade both managed clusters and their configuration from 4.18.3 to 4.18.4:

Create a seed image using the Lifecycle Agent. As mentioned, the seed image infra.5g-deployment.lab:8443/rhsysdeseng/lab5gran:v4.18.4 is already available in the container registry.
Verify that all software components meet the required versions. You can find the minimum software requirements here.
Install the Lifecycle Agent operator and the OpenShift API for Data Protection (OADP) in the managed clusters.
Configure OADP to back up and restore the configuration of the managed clusters that isn’t included in the seed image.
Perform the upgrade using the ImageBasedGroupUpgrade CR, which combines the ImageBasedUpgrade (LCA) and ClusterGroupUpgrade APIs (TALM).

Both clusters were installed using different methodologies. While sno-abi was provisioned using the agent-based install workflow, sno-ibi was deployed using the image-based install. However, as long as they both meet the Seed cluster guidelines and have the same combination of hardware, Day 2 Operators, and cluster configuration as the 4.18.4 seed cluster, we’re ready to proceed.

First, let’s verify the software requirements in our hub cluster:

Verifying Hub Software Requirements

Let’s connect to the hub cluster and list all the installed operators:

oc --kubeconfig ~/hub-kubeconfig get operators

NAME                                                   AGE
advanced-cluster-management.open-cluster-management    28h
lvms-operator.openshift-storage                        28h
multicluster-engine.multicluster-engine                28h
openshift-gitops-operator.openshift-operators          28h
topology-aware-lifecycle-manager.openshift-operators   28h

Verify that the Image-Based Install (IBI) operator is installed:

oc --kubeconfig ~/hub-kubeconfig get pods -n multicluster-engine -lapp=image-based-install-operator

NAME                                            READY   STATUS    RESTARTS   AGE
image-based-install-operator-7f7659f86c-cd46k   2/2     Running   0          46h

Next, double-check that the TALM operator is running. Note that the name of the Pod is cluster-group-upgrade-controller-manager, which is based on the name of the upstream project Cluster Group Upgrade Operator.

oc --kubeconfig ~/hub-kubeconfig get pods -n openshift-operators

NAME                                                                READY   STATUS    RESTARTS   AGE
pod/cluster-group-upgrades-controller-manager-v2-789fd8fbcd-nn4k5   2/2     Running   0          103m
pod/openshift-gitops-operator-controller-manager-6794f4f9cc-vpm2b   2/2     Running   0          106m

Verifying Managed Clusters Requirements

In the previous sections, we deployed the sno-abi cluster using the agent-based installation and the sno-ibi cluster using the image-based installation. Before starting the upgrade, we need to check if the Lifecycle Agent and OpenShift API for Data Protection operators are installed in the target clusters.

We already know that both SNO clusters meet the seed cluster guidelines and have the same combination of hardware, Day 2 Operators, and cluster configuration as the target seed cluster from which the seed image version v4.18.4 was obtained.

Let’s obtain the kubeconfigs for both clusters, as we will need them for the next sections:

oc --kubeconfig ~/hub-kubeconfig -n sno-abi extract secret/sno-abi-admin-kubeconfig --to=- > ~/abi-cluster-kubeconfig
oc --kubeconfig ~/hub-kubeconfig -n sno-ibi extract secret/sno-ibi-admin-kubeconfig --to=- > ~/ibi-cluster-kubeconfig

If we check the operators installed in sno-abi, we’ll notice that neither LCA nor OADP are installed. This is expected because, as we saw in the Crafting Common Policies section, only SR-IOV and LVM operators were installed as Day 2 Operators.

oc --kubeconfig ~/abi-cluster-kubeconfig get csv -A

NAMESPACE                              NAME                                          DISPLAY                   VERSION               REPLACES   PHASE
openshift-operator-lifecycle-manager   packageserver                                 Package Server            0.0.1-snapshot                   Succeeded
openshift-sriov-network-operator       sriov-network-operator.v4.18.0-202502121533   SR-IOV Network Operator   4.18.0-202502121533              Succeeded
openshift-storage                      lvms-operator.v4.18.0                         LVM Storage               4.18.0                           Succeeded

On the other hand, sno-ibi is running all the required operators: Day 2 Operators plus LCA and OADP, which are necessary to run the image-based upgrade process. This is because they were already included in the seed image version v4.18.3. See Running the Seed Image Generation for a list of the operators installed in the seed cluster. They are the same versions because sno-ibi was provisioned with that seed image (image-based installation).

oc --kubeconfig  ~/ibi-cluster-kubeconfig get csv -A

NAMESPACE                              NAME                                          DISPLAY                   VERSION               REPLACES   PHASE
openshift-adp                          oadp-operator.v1.4.2                          OADP Operator             1.4.2                            Succeeded
openshift-lifecycle-agent              lifecycle-agent.v4.18.0                       Lifecycle Agent           4.18.0                           Succeeded
openshift-operator-lifecycle-manager   packageserver                                 Package Server            0.0.1-snapshot                   Succeeded
openshift-sriov-network-operator       sriov-network-operator.v4.18.0-202502121533   SR-IOV Network Operator   4.18.0-202502121533              Succeeded
openshift-storage                      lvms-operator.v4.18.0                         LVM Storage               4.18.0                           Succeeded

Okay, let’s install the missing LCA and OADP operators in sno-abi and configure them appropriately in both managed clusters. To achieve this, a PolicyGenTemplate and a cluster group upgrade (CGU) CR will be created in the hub cluster so that TALM manages the installation and configuration process.

Remember that an already completed CGU was applied to sno-abi. As mentioned in the inform policies section, not all policies are enforced automatically; the user has to create the appropriate CGU resource to enforce them. However, when using ZTP, we want our cluster provisioned and configured automatically. This is where TALM steps in, processing the set of created policies (inform) and enforcing them once the cluster has been successfully provisioned. Therefore, the configuration stage starts without any intervention, resulting in our OpenShift cluster being ready to process workloads.

You might encounter an UpgradeNotCompleted error. If that’s the case, you need to wait for the remaining policies to be applied. You can check the policies' status here.

oc --kubeconfig ~/hub-kubeconfig get cgu sno-abi -n ztp-install

NAME   AGE     STATE       DETAILS
sno-abi   7m26s   Completed   All clusters are compliant with all the managed policies

Preparing Managed Clusters for Upgrade

At this stage, we are going to fulfill the image-based upgrade prerequisites using a GitOps approach. This will allow us to easily scale from our two SNOs to a fleet of thousands of SNO clusters if needed. We will achieve this by:

Creating a PolicyGenTemplate to install LCA and OADP. This also configures OADP to back up and restore the ACM and LVM setup.
Creating a CGU to enforce the policies on the target clusters.
Configuring the managed clusters with configurations that were not or could not be included in the seed image.
Running the IBGU process.

An S3-compatible storage server is required for backup and restore operations. See the S3 Storage Server section.

Let’s create a PGT called requirements-upgrade on the hub cluster. This will help us to create multiple RHACM policies, as explained in the PolicyGen DeepDive section.

cat <<EOF > ~/5g-deployment-lab/ztp-repository/site-policies/fleet/active/requirements-upgrade.yaml
---
apiVersion: ran.openshift.io/v1
kind: PolicyGenTemplate
metadata:
  name: "requirements-upgrade"
  namespace: "ztp-policies"
spec:
  bindingRules:
    common: "ocp418"
    logicalGroup: "active"
    du-zone: europe
  mcp: master
  remediationAction: inform
  sourceFiles:
    - fileName: LcaSubscriptionOperGroup.yaml
      metadata:
        name: lifecycle-agent-operatorgroup
      policyName: subscriptions-policy
    - fileName: LcaSubscription.yaml
      spec:
        channel: stable
        source: redhat-operator-index
      policyName: subscriptions-policy
    - fileName: LcaOperatorStatus.yaml
      policyName: subscriptions-policy
    - fileName: LcaSubscriptionNS.yaml
      policyName: subscriptions-policy
    - fileName: OadpSubscriptionOperGroup.yaml
      policyName: subscriptions-policy
    - fileName: OadpSubscription.yaml
      spec:
        source: redhat-operator-index
      policyName: subscriptions-policy
    - fileName: OadpOperatorStatus.yaml
      policyName: subscriptions-policy
    - fileName: OadpSubscriptionNS.yaml
      policyName: subscriptions-policy
    - fileName: OadpSecret.yaml
      data:
        cloud: W2RlZmF1bHRdCmF3c19hY2Nlc3Nfa2V5X2lkPWFkbWluCmF3c19zZWNyZXRfYWNjZXNzX2tleT1hZG1pbjEyMzQK
      policyName: config-policy
    - fileName: OadpDataProtectionApplication.yaml
      spec:
        backupLocations:
        - velero:
            config:
              insecureSkipTLSVerify: "true"
              profile: default
              region: minio
              s3ForcePathStyle: "true"
              s3Url: http://192.168.125.1:9002
            credential:
              key: cloud
              name: cloud-credentials
            default: true
            objectStorage:
              bucket: '{{hub .ManagedClusterName hub}}'
              prefix: velero
            provider: aws
      policyName: config-policy
    - fileName: OadpBackupStorageLocationStatus.yaml
      policyName: config-policy
    - fileName: ConfigMapGeneric.yaml
      complianceType: mustonlyhave
      policyName: "extra-manifests"
      metadata:
        name: disconnected-ran-config
        namespace: openshift-lifecycle-agent
EOF

Create the openshift-adp namespace on the hub cluster. This is required so that the backup and restore configurations will automatically propagate to the target clusters.

cat <<EOF > ~/5g-deployment-lab/ztp-repository/site-policies/fleet/active/ns-oadp.yaml
---
apiVersion: v1
kind: Namespace
metadata:
  creationTimestamp: null
  name: openshift-adp
spec: {}
status: {}
EOF

Include the configuration of a disconnected catalog source as an extra manifest. Remember that catalog sources are not included during the seed image creation.

cat <<EOF > ~/5g-deployment-lab/ztp-repository/site-policies/fleet/active/extra-manifests.yaml
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: disconnected-ran-config
  namespace: ztp-policies
data:
  redhat-operator-index.yaml: |
    apiVersion: operators.coreos.com/v1alpha1
    kind: CatalogSource
    metadata:
      annotations:
        target.workload.openshift.io/management: '{"effect": "PreferredDuringScheduling"}'
      name: redhat-operator-index
      namespace: openshift-marketplace
    spec:
      displayName: default-cat-source
      image: infra.5g-deployment.lab:8443/redhat/redhat-operator-index:v4.18-1742933492
      publisher: Red Hat
      sourceType: grpc
      updateStrategy:
        registryPoll:
          interval: 1h
EOF

Modify the kustomization.yaml inside the site-policies folder so that it includes this new PGT and eventually will be applied by ArgoCD.

cat <<EOF > ~/5g-deployment-lab/ztp-repository/site-policies/fleet/active/kustomization.yaml
---
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
generators:
  - common-418.yaml
  - group-du-sno.yaml
  - requirements-upgrade.yaml
#  - group-du-sno-validator.yaml

configMapGenerator:
- files:
  - source-crs/reference-crs/ibu/PlatformBackupRestoreWithIBGU.yaml
  - source-crs/reference-crs/ibu/PlatformBackupRestoreLvms.yaml
  name: oadp-cm
  namespace: openshift-adp
generatorOptions:
  disableNameSuffixHash: true

resources:
  - group-hardware-types-configmap.yaml
  - ns-oadp.yaml
  - extra-manifests.yaml

patches:
- target:
    group: policy.open-cluster-management.io
    version: v1
    kind: Policy
    name: requirements-upgrade-extra-manifests
  patch: |-
    - op: replace
      path: /spec/policy-templates/0/objectDefinition/spec/object-templates/0/objectDefinition/data
      value: '{{hub copyConfigMapData "ztp-policies" "disconnected-ran-config" hub}}'
EOF

Then commit all the changes:

cd ~/5g-deployment-lab/ztp-repository/site-policies
git add *
git commit -m "adds upgrade policy"
git push origin main

ArgoCD will automatically synchronize the new policies and show them as non-compliant in the RHACM WebUI:

However, only one cluster is not compliant. If we check the policy information, we will see that this policy is only targeting the sno-abi cluster. This is because the sno-ibi cluster does not have the labels du-zone=europe that the PGT is targeting in its binding rule.

Let’s add the proper labels to the sno-ibi SNO cluster:

oc --kubeconfig ~/hub-kubeconfig label managedcluster sno-ibi du-zone=europe

At this point, we need to create a Cluster Group Upgrade (CGU) resource that will start the preparation for the upgrade process.

cat <<EOF | oc --kubeconfig ~/hub-kubeconfig apply -f -
---
apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
  name: requirements-upgrade
  namespace: ztp-policies
spec:
  clusters:
  - sno-abi
  - sno-ibi
  managedPolicies:
  - requirements-upgrade-subscriptions-policy
  - requirements-upgrade-config-policy
  - requirements-upgrade-extra-manifests
  remediationStrategy:
    maxConcurrency: 2
    timeout: 240
EOF

As explained, the CGU enforces the recently created policies.

We can monitor the remediation process using the command line as well:

oc --kubeconfig ~/hub-kubeconfig get cgu -n ztp-policies

NAMESPACE      NAME                   AGE   STATE        DETAILS
ztp-policies   requirements-upgrade   35s   InProgress   Remediating non-compliant policies

In a few minutes, we will see a similar output as the one described below:

NAMESPACE      NAME                   AGE     STATE       DETAILS
ztp-policies   requirements-upgrade   2m52s   Completed   All clusters are compliant with all the managed policies

Finally, once the upgrade policy is applied successfully we can remove it so it is not checked against our SNO clusters and consuming resources. Do not forget to commit the changes:

sed -i "s|- requirements-upgrade.yaml|#- requirements-upgrade.yaml|" ~/5g-deployment-lab/ztp-repository/site-policies/fleet/active/kustomization.yaml
cd ~/5g-deployment-lab/ztp-repository/site-policies
git add ~/5g-deployment-lab/ztp-repository/site-policies/fleet/active/kustomization.yaml
git commit -m "Removes upgrade policy once it is applied"
git push origin main

ArgoCD will automatically synchronize the new changes and remove them from the RHACM WebUI.

Creating the Image Based Group Upgrade

The ImageBasedGroupUpgrade (IBGU) CR combines the ImageBasedUpgrade and ClusterGroupUpgrade APIs. It simplifies the upgrade process by using a single resource on the hub cluster—the ImageBasedGroupUpgrade custom resource (CR)—to manage an image-based upgrade on a selected group of managed clusters throughout all stages. A detailed view of the different upgrade stages driven by the LCA operator is explained in the Image Based Upgrades section.

Let’s apply the IBGU and start the image-based upgrade process in both clusters simultaneously:

Note how we can include extra install configurations (extraManifests) and backup/restore steps (oadpContent) within the same custom resource.

cat <<EOF | oc --kubeconfig ~/hub-kubeconfig apply -f -
---
apiVersion: v1
kind: Secret
metadata:
  name: disconnected-registry-pull-secret
  namespace: default
stringData:
  .dockerconfigjson: '{"auths":{"infra.5g-deployment.lab:8443":{"auth":"YWRtaW46cjNkaDR0MSE="}}}'
  type: kubernetes.io/dockerconfigjson
---
apiVersion: lcm.openshift.io/v1alpha1
kind: ImageBasedGroupUpgrade
metadata:
  name: telco5g-lab
  namespace: default
spec:
  clusterLabelSelectors:
    - matchExpressions:
      - key: name
        operator: In
        values:
        - sno-abi
        - sno-ibi
  ibuSpec:
    seedImageRef:
      image: infra.5g-deployment.lab:8443/rhsysdeseng/lab5gran:v4.18.4
      version: 4.18.4
      pullSecretRef:
        name: disconnected-registry-pull-secret
    extraManifests:
      - name: disconnected-ran-config
        namespace: openshift-lifecycle-agent
    oadpContent:
      - name: oadp-cm
        namespace: openshift-adp
  plan:
    - actions: ["Prep", "Upgrade", "FinalizeUpgrade"]
      rolloutStrategy:
        maxConcurrency: 10
        timeout: 2400
EOF

Check that an IBGU object has been created in the hub cluster, along with an associated CGU:

oc --kubeconfig ~/hub-kubeconfig get ibgu,cgu -n default

NAMESPACE   NAME                                                  AGE
default     imagebasedgroupupgrade.lcm.openshift.io/telco5g-lab   102s

NAMESPACE      NAME                                                                              AGE    STATE        DETAILS
default        clustergroupupgrade.ran.openshift.io/telco5g-lab-prep-upgrade-finalizeupgrade-0   102s   InProgress   Rolling out manifestworks

Monitoring the Image Based Group Upgrade

As explained in the Lifecycle Agent Operator (LCA) section, the SNO cluster will progress through these stages:

The progress of the upgrade can be tracked by checking the status field of the IBGU object:

oc --kubeconfig ~/hub-kubeconfig get ibgu -n default telco5g-lab -ojson | jq .status

{
  "clusters": [
    {
      "currentAction": {
        "action": "Prep"
      },
      "name": "sno-abi"
    },
    {
      "currentAction": {
        "action": "Prep"
      },
      "name": "sno-ibi"
    }
  ],
  "conditions": [
    {
      "lastTransitionTime": "2025-02-13T09:27:47Z",
      "message": "Waiting for plan step 0 to be completed",
      "reason": "InProgress",
      "status": "True",
      "type": "Progressing"
    }
  ],
  "observedGeneration": 1
}

The output shows that the upgrade has just started because both sno-abi and sno-ibi are in the Prep stage.

We can also monitor progress by connecting directly to the managed clusters and obtaining the status of the IBU resource:

The Prep stage is the current stage, as indicated by the Desired Stage field. The Details field provides extra information; in this case, it indicates preparation to copy the stateroot seed image.

During the Prep stage, the clusters will start pulling images. This will put some IO pressure in the hypervisor, you may lose connectivity to clusters APIs multiple times for about 10 minutes.

oc --kubeconfig ~/abi-cluster-kubeconfig get ibu

NAME      AGE   DESIRED STAGE   STATE        DETAILS
upgrade   36s   Prep            InProgress   Stateroot setup job in progress. job-name: lca-prep-stateroot-setup, job-namespace: openshift-lifecycle-agent

Once the Prep stage is complete, the IBGU object will automatically move to the Upgrade stage. The output below shows that both SNO clusters have completed the Prep stage (completedActions) and their currentAction is Upgrade.

oc --kubeconfig ~/hub-kubeconfig get ibgu -n default telco5g-lab -ojson | jq .status

{
  "clusters": [
    {
      "completedActions": [
        {
          "action": "Prep"
        }
      ],
      "currentAction": {
        "action": "Upgrade"
      },
      "name": "sno-abi"
    },
    {
      "completedActions": [
        {
          "action": "Prep"
        }
      ],
      "currentAction": {
        "action": "Upgrade"
      },
      "name": "sno-ibi"
    }
  ],
  "conditions": [
    {
      "lastTransitionTime": "2025-03-05T11:50:26Z",
      "message": "Waiting for plan step 0 to be completed",
      "reason": "InProgress",
      "status": "True",
      "type": "Progressing"
    }
  ],
  "observedGeneration": 1
}

At some point during the Upgrade stage, the SNO clusters will restart and boot from the new stateroot. Depending on the host, the process until the Kubernetes API is available again can take several minutes.

After some time, the SNO clusters will move to the FinalizeUpgrade stage. The output below shows that sno-ibi has moved to the FinalizeUpgrade stage, while sno-abi is still in the previous stage (Upgrade).

oc --kubeconfig ~/hub-kubeconfig get ibgu -n default telco5g-lab -ojson | jq .status

{
  "clusters": [
    {
      "completedActions": [
        {
          "action": "Prep"
        }
      ],
      "currentAction": {
        "action": "Upgrade"
      },
      "name": "sno-abi"
    },
    {
      "completedActions": [
        {
          "action": "Prep"
        },
        {
          "action": "Upgrade"
        }
      ],
      "currentAction": {
        "action": "FinalizeUpgrade"
      },
      "name": "sno-ibi"
    }
  ],
  "conditions": [
    {
      "lastTransitionTime": "2025-03-05T11:50:26Z",
      "message": "Waiting for plan step 0 to be completed",
      "reason": "InProgress",
      "status": "True",
      "type": "Progressing"
    }
  ],
  "observedGeneration": 1
}

During the FinalizeUpgrade stage, the SNO clusters are restoring the data that was backed up. This data was included in the disconnected-ran-config configMap described in the Preparing Managed Clusters for Upgrade section. Connecting to sno-ibi during this stage will show the restore process running.

oc --kubeconfig ~/ibi-cluster-kubeconfig get ibu

NAME      AGE     DESIRED STAGE   STATE        DETAILS
upgrade   8m46s   Upgrade         InProgress   Restore of Application Data is in progress

Finally, both SNO clusters have moved to the FinalizeUpgrade stage:

oc --kubeconfig ~/hub-kubeconfig get ibgu -n default telco5g-lab -ojson | jq .status

{
  "clusters": [
    {
      "completedActions": [
        {
          "action": "Prep"
        },
        {
          "action": "Upgrade"
        },
        {
          "action": "FinalizeUpgrade"
        }
      ],
      "name": "sno-abi"
    },
    {
      "completedActions": [
        {
          "action": "Prep"
        },
        {
          "action": "Upgrade"
        },
        {
          "action": "FinalizeUpgrade"
        }
      ],
      "name": "sno-ibi"
    }
  ],
  "conditions": [
    {
      "lastTransitionTime": "2025-02-13T09:47:57Z",
      "message": "All plan steps are completed",
      "reason": "Completed",
      "status": "False",
      "type": "Progressing"
    }
  ],
  "observedGeneration": 1
}

We can also check that the CGU associated with the IBGU shows as Completed:

oc --kubeconfig ~/hub-kubeconfig get ibgu,cgu -n default

NAMESPACE   NAME                                                  AGE
default     imagebasedgroupupgrade.lcm.openshift.io/telco5g-lab   22m

NAMESPACE      NAME                                                                              AGE     STATE       DETAILS
default        clustergroupupgrade.ran.openshift.io/telco5g-lab-prep-upgrade-finalizeupgrade-0   22m     Completed   All manifestworks rolled out successfully on all clusters

Note that once the upgrade is finished, the IBU CR from both clusters moves back to the initial state: Idle.

oc --kubeconfig ~/ibi-cluster-kubeconfig get ibu

NAME      AGE     DESIRED STAGE   STATE   DETAILS
upgrade   9m37s   Idle            Idle    Idle

Note the time it took to update the SNO clusters. As shown above, the sno-ibi took less than 10 minutes to run a z-stream update. Compared with the regular update time of around 40 minutes, the upgrade time reduction is significant.

As a final verification, check that both upgraded clusters comply with the Telco RAN DU reference configuration configuration and the OpenShift version showed is the Infrastructure tab is the expected one: