Troubleshooting Tips

In this section, we will provide some useful tips to troubleshoot any issue that can arise during the execution of this lab.

Verification of the lab status

Containerized Infrastructure Services

There are several infrastructure services that run as containers such as the Git server or the container registry. Below the list of containers that should be running.

All these containers are managed via the following systemd services:

podman-gitea.service
podman-minio.service
podman-registry.service
podman-showroom-apache.service
podman-showroom-firefox.service
podman-showroom-traefik.service
podman-showroom-wetty.service
podman-webcache.service

sudo podman ps

CONTAINER ID  IMAGE                                             COMMAND               CREATED       STATUS       PORTS                                                                                     NAMES
22b7da48762d  quay.io/alosadag/httpd:p8080                      httpd-foreground      11 hours ago  Up 11 hours                                                                                            webcache
3ca3e6d08b86  quay.io/mavazque/registry:2.7.1                   /etc/docker/regis...  11 hours ago  Up 11 hours                                                                                            registry
afa55e6fec53  quay.io/minio/minio:RELEASE.2025-02-07T23-21-09Z  server /data --co...  11 hours ago  Up 11 hours  0.0.0.0:9001->9001/tcp, 0.0.0.0:9002->9000/tcp                                            minio-server
0a044c32bb1d  quay.io/mavazque/gitea:1.17.3                     /bin/s6-svscan /e...  11 hours ago  Up 11 hours  0.0.0.0:2222->22/tcp, 0.0.0.0:3000->3000/tcp                                              gitea
0fd211cc1bb8  quay.io/fedora/httpd-24-micro:2.4                 /usr/bin/run-http...  11 hours ago  Up 11 hours  192.168.125.1:8181->8080/tcp                                                              showroom-apache
337206f80582  quay.io/rhsysdeseng/showroom:wetty                --ssh-user=lab-us...  11 hours ago  Up 11 hours  192.168.125.1:3001->3000/tcp                                                              showroom-wetty
687cb4e59f6a  quay.io/rhsysdeseng/showroom:traefik-v3.3.4       --configFile=/etc...  11 hours ago  Up 11 hours  0.0.0.0:80->80/tcp, 0.0.0.0:443->443/tcp, 0.0.0.0:3003->3003/tcp, 0.0.0.0:8888->8888/tcp  showroom-traefik
a0cec2b7b429  quay.io/rhsysdeseng/showroom:webfirefox                                 11 hours ago  Up 11 hours                                                                                            showroom-firefox

SNO Virtual Machines

In the lab we are going to provision and configure two SNO clusters named sno-abi and sno-ibi. Let’s double check that the virtual machines exist and are stopped.

kcli list vm

+----------------+--------+----------------+----------------------------------------------------+------+---------+
|      Name      | Status |       Ip       |                       Source                       | Plan | Profile |
+----------------+--------+----------------+----------------------------------------------------+------+---------+
| hub-ctlplane-0 |   up   | 192.168.125.20 | rhcos-418.94.202501221327-0-openstack.x86_64.qcow2 | hub  |  kvirt  |
| hub-ctlplane-1 |   up   | 192.168.125.21 | rhcos-418.94.202501221327-0-openstack.x86_64.qcow2 | hub  |  kvirt  |
| hub-ctlplane-2 |   up   | 192.168.125.22 | rhcos-418.94.202501221327-0-openstack.x86_64.qcow2 | hub  |  kvirt  |
|    sno-abi     |  down  |                |                                                    | hub  |  kvirt  |
|    sno-ibi     |  down  |                |                                                    | hub  |  kvirt  |
|    sno-seed    |   up   | 192.168.125.30 |                                                    | hub  |  kvirt  |
+----------------+--------+----------------+----------------------------------------------------+------+---------+

Hub cluster

Before working with oc commands you can enable command auto-completion by running:

source <(oc completion bash)
# Make it persistent
oc completion bash >> /etc/bash_completion.d/oc_completion

Check the status of the hub cluster.

oc --kubeconfig ~/hub-kubeconfig get clusterversion

NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   v4.19.0   True        False         22h     Cluster version is v4.19.0

oc --kubeconfig ~/hub-kubeconfig get nodes

NAME                               STATUS   ROLES                         AGE   VERSION
hub-ctlplane-0.5g-deployment.lab   Ready    control-plane,master,worker   10h   v1.32.5
hub-ctlplane-1.5g-deployment.lab   Ready    control-plane,master,worker   10h   v1.32.5
hub-ctlplane-2.5g-deployment.lab   Ready    control-plane,master,worker   10h   v1.32.5

oc --kubeconfig ~/hub-kubeconfig get operators

NAME                                                   AGE
advanced-cluster-management.open-cluster-management    10h
lvms-operator.openshift-storage                        10h
multicluster-engine.multicluster-engine                10h
openshift-gitops-operator.openshift-operators          10h
topology-aware-lifecycle-manager.openshift-operators   10h

oc --kubeconfig ~/hub-kubeconfig get catalogsources -A

NAMESPACE               NAME                                    DISPLAY   TYPE   PUBLISHER   AGE
openshift-marketplace   cs-redhat-operator-index-v4-18-174293             grpc               10h

DNS resolution

Verify that the OpenShift API and the apps domain (wildcard) can be resolved.

dig command is not part of the standard linux utilities (you may need to install it), in RHEL-based systems is part of the bind-utils package.

dig +short api.hub.5g-deployment.lab

192.168.125.10

dig +short oauth-openshift.apps.hub.5g-deployment.lab

192.168.125.11

Policies not showing in the Governance console

In cases where the policies are not shown in the Governance section of the Multicloud console we have to check first, if the policies Argo application was synced successfully.

Verify that the policies in the hub cluster are similar to the ones shown below. Remember that inform as remediation is correct.

oc --kubeconfig ~/hub-kubeconfig get policies -A

NAMESPACE      NAME                                       REMEDIATION ACTION   COMPLIANCE STATE   AGE
sno-abi        ztp-policies.common-config-policy          inform               Compliant          116m
sno-abi        ztp-policies.common-subscriptions-policy   inform               Compliant          116m
sno-abi        ztp-policies.du-sno-group-policy           inform               Compliant          116m
sno-abi        ztp-policies.du-sno-sites-sites-policy     inform               Compliant          116m
sno-ibi        ztp-policies.common-config-policy          inform               Compliant          92m
sno-ibi        ztp-policies.common-subscriptions-policy   inform               Compliant          92m
sno-ibi        ztp-policies.du-sno-group-policy           inform               Compliant          92m
sno-ibi        ztp-policies.du-sno-sites-sites-policy     inform               Compliant          92m
ztp-policies   common-config-policy                       inform               Compliant          117m
ztp-policies   common-subscriptions-policy                inform               Compliant          117m
ztp-policies   common-test-config-policy                  inform                                  117m
ztp-policies   common-test-subscriptions-policy           inform                                  117m
ztp-policies   du-sno-group-policy                        inform               Compliant          117m
ztp-policies   du-sno-sites-sites-policy                  inform               Compliant          117m
ztp-policies   du-sno-test-group-policy                   inform                                  117m

Policies not applied

In such cases it can be because of multiple errors. First, let’s check that the policies are shown in the Governance console.

If the policies show a warning message in the Cluster violations section, it is because the SNO servers are still being provisioned. You can double check the status of the provisioning in the Infrastructure → Clusters section. Verify that there is not ztp-running label added yet.

In cases where the Governance console shows policies already assigned to SNO clusters, we should check the status of the TALM operator. Remember, that it is responsible of moving the policies from inform to enforce, so they are eventually applied. Check the status of the cluster-group-upgrades-controller-manager Pod and its logs:

oc --kubeconfig ~/hub-kubeconfig get pods -n openshift-operators

NAME                                                            READY   STATUS    RESTARTS   AGE
cluster-group-upgrades-controller-manager-v2-6fcb8695bf-bvngg   2/2     Running   0          10h
openshift-gitops-operator-controller-manager-65b8984b8c-2qp8j   2/2     Running   0          10h

Next, we can verify that a ClusterGroupUpgrade CR was created automatically by the TALM operator. If it is not created, it means that either the label is not set yet in the cluster or the operator is having issues. In the latest case, check the logs as explained previously.

oc --kubeconfig ~/hub-kubeconfig get cgu -A

NAMESPACE      NAME                                         AGE   STATE       DETAILS
ztp-install    sno-abi                                      73m   Completed   All clusters are compliant with all the managed policies
ztp-install    sno-ibi                                      66m   Completed   All clusters are compliant with all the managed policies

Describing the CGU shows a lot of information about the current status of the configuration:

oc --kubeconfig ~/hub-kubeconfig get cgu -n ztp-install sno-abi -o yaml

apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
... REDACTED ...
status:
  clusters:
  - name: sno-abi
    state: complete
  computedMaxConcurrency: 1
  conditions:
  - lastTransitionTime: "2025-04-01T19:08:47Z"
    message: All selected clusters are valid
    reason: ClusterSelectionCompleted
    status: "True"
    type: ClustersSelected
  - lastTransitionTime: "2025-04-01T19:08:47Z"
    message: Completed validation
    reason: ValidationCompleted
    status: "True"
    type: Validated
  - lastTransitionTime: "2025-04-01T19:10:35Z"
    message: All clusters are compliant with all the managed policies
    reason: Completed
    status: "False"
    type: Progressing
  - lastTransitionTime: "2025-04-01T19:10:35Z"
    message: All clusters are compliant with all the managed policies
    reason: Completed
    status: "True"
    type: Succeeded
  managedPoliciesContent:
    common-subscriptions-policy: '[{"kind":"Subscription","name":"lvms-operator","apiVersion":"operators.coreos.com/v1alpha1","namespace":"openshift-storage"},{"kind":"Subscription","name":"sriov-network-operator-subscription","apiVersion":"operators.coreos.com/v1alpha1","namespace":"openshift-sriov-network-operator"}]'
  managedPoliciesForUpgrade:
  - name: common-config-policy
    namespace: ztp-policies
  - name: common-subscriptions-policy
    namespace: ztp-policies
  - name: du-sno-group-policy
    namespace: ztp-policies
  - name: du-sno-sites-sites-policy
    namespace: ztp-policies
  managedPoliciesNs:
    common-config-policy: ztp-policies
    common-subscriptions-policy: ztp-policies
    du-sno-group-policy: ztp-policies
    du-sno-sites-sites-policy: ztp-policies
  remediationPlan:
  - - sno-abi
  status:
    completedAt: "2025-04-01T19:10:35Z"
    startedAt: "2025-04-01T19:08:48Z"

Verfiy that now there are two times the number of policies in the hub cluster. That’s because a enforce copy of each one of them was created.

oc --kubeconfig ~/hub-kubeconfig get policies -A

NAMESPACE      NAME                                       REMEDIATION ACTION   COMPLIANCE STATE   AGE
sno-abi        ztp-policies.common-config-policy          inform               Compliant          121m
sno-abi        ztp-policies.common-subscriptions-policy   inform               Compliant          121m
sno-abi        ztp-policies.du-sno-group-policy           inform               Compliant          121m
sno-abi        ztp-policies.du-sno-sites-sites-policy     inform               Compliant          121m
sno-ibi        ztp-policies.common-config-policy          inform               Compliant          97m
sno-ibi        ztp-policies.common-subscriptions-policy   inform               Compliant          97m
sno-ibi        ztp-policies.du-sno-group-policy           inform               Compliant          97m
sno-ibi        ztp-policies.du-sno-sites-sites-policy     inform               Compliant          97m
ztp-policies   common-config-policy                       inform               Compliant          121m
ztp-policies   common-subscriptions-policy                inform               Compliant          121m
ztp-policies   common-test-config-policy                  inform                                  121m
ztp-policies   common-test-subscriptions-policy           inform                                  121m
ztp-policies   du-sno-group-policy                        inform               Compliant          121m
ztp-policies   du-sno-sites-sites-policy                  inform               Compliant          121m
ztp-policies   du-sno-test-group-policy                   inform                                  121m

Each enforce policy is being applied one by one. There can be cases where the Cluster violations or the Compliance status is not set for the enforced cluster. It takes time to move to the next one depending on the changes applied to the target cluster.