Topology Aware Lifecycle Manager (TALM)

Leveraging TALM, we can manage the lifecycle of thousands of clusters at scale in a constrained operational environment. This capability is essential when working in telecom (e.g., RAN) or other environments where service level agreements require managed updates to a fleet of clusters.

Remember that ZTP generates installation and configuration resources from manifests (ClusterInstances and PolicyGenerators) stored in Git. These artifacts are applied to a centralized hub cluster where a combination of OpenShift GitOps, Red Hat Advanced Cluster Management, the Assisted Service deployed by the Infrastructure Operator, and the Topology Aware Lifecycle Manager (TALM) use them to install and configure the cluster. See ZTP components chapter.

TALM is responsible for managing the rollout of configuration throughout the lifecycle of the fleet of clusters. During initial deployment, the configuration phase of this ZTP process depends on TALM to orchestrate the application of the configuration custom resources to the cluster. When day-N configuration changes need to be rolled out to the fleet, TALM will manage that rollout in progressive limited size batches of clusters. When upgrades to OpenShift or the day-two operators are needed, TALM will progressively roll those out as well. There are several key integration points between GitOps, RHACM Policy, and TALM.

Default inform policies

In this solution, as mentioned in the Managing at scale section, all policies will be created with a remediation action of "inform". With this remediation action Red Hat ACM will notice the compliance state of the policy and raise the status to the user but will not take any action to apply the desired configuration. When the administrator of the fleet is ready to bring the clusters into compliance with the policy, TALM provides the tools and configurability to progressively remediate the fleet.

TALM will step through the set of created policies and switch them to an "enforce" policy in order to push configuration to the spoke cluster. This strategy ensures that initial deployment/ZTP integrates seamlessly with future configuration changes that need to be made without the risk of rolling those changes out to all spoke clusters in the network simultaneously.

TALM enables us to select the timing and the clusters where the configuration is about to be applied.

Cluster Group Upgrade (CGU)

The Topology Aware Lifecycle Manager (TALM) builds the remediation plan from the ClusterGroupUpgrade (CGU) custom resource for a group of clusters. You can define the following specifications in a ClusterGroupUpgrade CR. Note that this is a CGU CR that enforces an OpenShift release upgrade which was previously set by a policy called du-upgrade-platform-upgrade defined in the managedPolicies spec.

apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
  name: ocp-upgrade
  namespace: ztp-group-du-sno
spec:
  preCaching: true #precache enabled before upgrade
  backup: true  #backup enabled before upgrade
  deleteObjectsOnCompletion: false
  clusters:   #Clusters in the group
  - snonode-virt02
  - snonode-virt01
  - snonode-virt03
  enable: true
  managedPolicies:
  - du-upgrade-platform-upgrade #Applicable list of managed policies
  remediationStrategy:
    canaries:
      - snonode-virt01 #Defines the clusters for canary updates
    maxConcurrency: 2 #Defines the maximum number of concurrent updates in a batch
    timeout: 240

TALM does not only deal with OpenShift release upgrades, when we talk about upgrades in this section, we mean all kind of modifications to the managed clusters. Essentially, a CGU must be created every time we want to enforce a policy.

Auto creation of CGU

So far we have been talking of using TALM for our common day-2 operations. But, if a CGU must be created in order to enforce a previously applied policy, how is it possible to run a fully automated ZTP workflow, e.g. configure and even run our CNF workloads in a streamlined way, on the target clusters?

The answer is that TALM interacts with ZTP for newly created clusters. When we define one or multiple managed clusters in telco, we do not only want to provision them, we also want to apply a specific configuration such as the validated 5G RAN DU profile. Often, we also want our workloads run on top of them. ZTP is envisioned as a streamlined process, where as a developer we only push manifest definitions to the proper Git repository.

So, once clusters are provisioned, TALM monitors their state by checking the ManagedCluster CRs on the hub cluster. Any ManagedCluster CR which does not have a "ztp-done" label applied, including newly created ManagedCluster CRs, will cause TALM to automatically create a ClusterGroupUpgrade CR with the following characteristics:

It is created in the ztp-install namespace.
It has the same name as the ManagedCluster CR, usually the name of the cluster.
The cluster selector includes only the cluster associated with that ManagedCluster CR.
The set of managedPolicies includes ALL policies that RHACM has bound to the cluster at the time the CGU is created.
It is enabled.
Precaching is disabled.
Timeout set to 4 hours (240 minutes).

TALM will basically enforce all existing policies that are bound to a cluster that has not the label "ztp-done'. This is performed by creating automatically a proper CGU CR.

Phase labels

The ClusterGroupUpgrade CR that is auto created for ZTP includes directives to annotate the ManagedCluster CR with labels at the start and end of the ZTP process. When ZTP configuration (post-installation) commences, the ManagedCluster will have the ztp-running label applied. When all policies are remediated to the cluster (fully compliant) these directives will cause TALM to remove the ztp-running label and apply the ztp-done label.

In essence, the ztp-done label will be applied when the cluster is fully ready for deployment of applications.

Waves

A ZTP wave is basically an annotation (ran.openshift.io/ztp-deploy-wave) included on each policy template that will permit TALM to apply policies in an ordered manner. As an example, every policy generated from a PolicyGenerator by ZTP incorporates a wave number such as the SriovSubscriptionNS.yaml:

apiVersion: v1
kind: Namespace
metadata:
  name: openshift-sriov-network-operator
  annotations:
    workload.openshift.io/allowed: management
    ran.openshift.io/ztp-deploy-wave: "2"

Detailed information on policy templates can be found in the following PolicyGen deepdive section.

When TALM builds the auto-generated CGU as part of the ZTP phase, the included policies will be ordered according to these wave annotations. This is the only time that the wave annotations have any impact on TALM behavior.

As TALM works through the list of managedPolicies included in the CGU it will wait for each policy to be compliant before moving to the next policy. It is important to ensure that the order in which policies are listed under managedPolicies (and thus also the wave annotation used in populating this list during ZTP) takes into account any pre-requisites for the CRs in those policies to be applied to the cluster. For example an operator must be installed before, or concurrently with, the configuration for the operator.

All CRs in the same policy must have the same setting for the ztp-deploy-wave annotation. The default value of this annotation for each CR can be overridden in the PolicyGenerator. Default ztp-deploy-wave value can be found on the proper source crs.

SNO Upgrades at Scale

TALM helps with the orchestration of SNO clusters upgrades at scale. The recommended upgrade method for SNOs is the Image Based Upgrade (IBU), but as we described in the LCA section, LCA works at a single cluster level. TALM runs on the Hub cluster and can handle the upgrade of multiple clusters. In order to orchestrate such upgrades, the ImageBasedGroupUpgrade API can be used. We will see how to leverage this API in the hands-on section of the lab. You can read more on Image Based Group Upgrades on the official docs.