Troubleshooting Tips

In this section, we will provide some useful tips to troubleshoot any issue that can arise during the execution of this lab.

Below commands must be executed from the hypervisor host as root if not specified otherwise.

Verification of the lab status

Git repository and registry

Both the registry and Git repository are running in the hypervisor host as containers:

podman ps

CONTAINER ID  IMAGE                            COMMAND               CREATED       STATUS           PORTS                                         NAMES
e7cf765660a3  quay.io/mavazque/registry:2.7.1  /etc/docker/regis...  24 hours ago  Up 24 hours ago                                                registry
557b51f975ce  quay.io/mavazque/gitea:1.17.3    /bin/s6-svscan /e...  24 hours ago  Up 24 hours ago  0.0.0.0:2222->22/tcp, 0.0.0.0:3000->3000/tcp  gitea

SNO2 virtual machine

In the lab we are going to provision and configure a SNO cluster named SNO2. Let’s double check that the virtual machine exists and it is stopped.

kcli list vm

+---------------+--------+----------------+------------------------------------------------------+-------------+---------+
|      Name     | Status |       Ip       |                        Source                        |     Plan    | Profile |
+---------------+--------+----------------+------------------------------------------------------+-------------+---------+
| hub-installer |   up   | 192.168.125.25 | CentOS-Stream-GenericCloud-8-20210603.0.x86_64.qcow2 | hub-cluster |  kvirt  |
|  hub-master0  |   up   | 192.168.125.20 |                                                      |     hub     |  kvirt  |
|  hub-master1  |   up   | 192.168.125.21 |                                                      |     hub     |  kvirt  |
|  hub-master2  |   up   | 192.168.125.22 |                                                      |     hub     |  kvirt  |
|      sno1     |   up   | 192.168.125.30 |                                                      |     hub     |  kvirt  |
|      sno2     |  down  |                |                                                      |     hub     |  kvirt  |
+---------------+--------+----------------+------------------------------------------------------+-------------+---------+

Hub cluster

Before working with oc commands you can enable command auto-completion by running:

source <(oc completion bash)
# Make it persistent
oc completion bash >> /etc/bash_completion.d/oc_completion

Check the status of the hub cluster.

export KUBECONFIG=~/hub-kubeconfig
oc get clusterversion

NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.13   True        False         22h     Cluster version is 4.11.13

oc get nodes

NAME           STATUS   ROLES           AGE   VERSION
ocp-master-0   Ready    master,worker   23h   v1.24.6+5157800
ocp-master-1   Ready    master,worker   23h   v1.24.6+5157800
ocp-master-2   Ready    master,worker   23h   v1.24.6+5157800

oc get operators

NAME                                                   AGE
advanced-cluster-management.open-cluster-management    22h
multicluster-engine.multicluster-engine                22h
odf-lvm-operator.openshift-storage                     22h
openshift-gitops-operator.openshift-operators          22h
topology-aware-lifecycle-manager.openshift-operators   22h

oc get catalogsources -A

NAMESPACE               NAME                    DISPLAY   TYPE   PUBLISHER   AGE
openshift-marketplace   redhat-operator-index             grpc               23h

DNS resolution

Verify that the OpenShift API and the apps domain (wildcard) can be resolved.

dig command is not part of the standard linux utilities (you may need to install it), in RHEL-based systems is part of the bind-utils package.

dig +short api.hub.5g-deployment.lab

192.168.125.10

dig +short oauth-openshift.apps.hub.5g-deployment.lab

192.168.125.11

ArgoCD sync not working

There could be that one or both Argo applications (clusters or policies) are kept synchronizing or that their status is set to OutOfSync. In such cases we can double check the following:

First, check using oc binary the status of all the Argo applications. You can describe or show the full yaml definition of the failed application.

oc get applications -A

NAMESPACE          NAME                       SYNC STATUS   HEALTH STATUS
openshift-gitops   clusters                   OutOfSync     Healthy
openshift-gitops   hub-operators-config       Synced        Healthy
openshift-gitops   hub-operators-deployment   Synced        Healthy
openshift-gitops   policies                   Synced        Healthy
openshift-gitops   sno1-deployment            Synced        Healthy

We can also connect to the OpenShift GitOps console and see there the error. In this case, there is a missing value in the SiteConfig because we did not copy and paste it properly.

There are also cases that we need to synchronize manually the application because we modified the different manifests. For instance, as detailed in the previous example where we wrongly copy and paste the SiteConfig defnition. In those cases, we can workaround the issue by accessing the GitOps console using the local admin user and password.

This is the permission denied error shown in the console:

Execute this command connected to the hub cluster to obtain the local admin password. Then, logout and access again by typing admin as username and the password the one stored as a secret in the hub cluster:

oc extract secret/openshift-gitops-cluster -n openshift-gitops  --to=-

More information about monitoring the status of the deployment can be found in the monitoring section.

SNO2 is down after syncing Argo applications

So, we have synced the cluster and policies applications in Argo successfully, e.g. everything is in green. If after 5 minutes we do not see the SNO2 virtual machine being booted we can make use of the following troubleshooting commands:

First, check if the BMC is ready so the server can be started remotely from the hub using Redfish commands.

curl -k https://192.168.125.1:9000/redfish/v1/Systems/aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0301

{
    "@odata.type": "#ComputerSystem.v1_1_0.ComputerSystem",
    "Id": "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0301",
    "Name": "sno2",
    "UUID": "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0301",
    "Manufacturer": "Sushy Emulator",
    "Status": {
        "State": "Enabled",
        "Health": "OK",
        "HealthRollUp": "OK"
    },
    "PowerState": "Off",
... REDACTED ...
    "@odata.context": "/redfish/v1/$metadata#ComputerSystem.ComputerSystem",
    "@odata.id": "/redfish/v1/Systems/aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0301",
    "@Redfish.Copyright": "Copyright 2014-2016 Distributed Management Task Force, Inc. (DMTF). For the full DMTF copyright policy, see http://www.dmtf.org/about/policies/copyright."

If the server is up, but the installation is not progressing we may check a couple of things. First, let’s verify the sno2.5g-deployment.lab BaremetalHost CR is created and in provisioned status in the hub cluster.

oc get bmh -A

NAMESPACE               NAME                     STATE                    CONSUMER             ONLINE   ERROR   AGE
openshift-machine-api   hub-ctlplane-0           externally provisioned   hub-7tjlv-master-0   true             24h
openshift-machine-api   hub-ctlplane-1           externally provisioned   hub-7tjlv-master-1   true             24h
openshift-machine-api   hub-ctlplane-2           externally provisioned   hub-7tjlv-master-2   true             24h
sno1                    sno1                     provisioned                                   true             24h
sno2                    sno2.5g-deployment.lab   provisioned                                   true             112s

Verify that the ISO image has been without errors by checking the InfraEnv and BaremetalHost custom resources:

oc get infraenv -A

NAMESPACE   NAME   ISO CREATED AT
sno1        sno1   2023-03-01T10:08:09Z
sno2        sno2   2023-03-02T10:26:06Z

oc get infraenv sno2 -n sno2 -oyaml

apiVersion: agent-install.openshift.io/v1beta1
kind: InfraEnv
metadata:
... REDACTED ...
  debugInfo:
    eventsURL: https://assisted-service-multicluster-engine.apps.hub.5g-deployment.lab/api/assisted-install/v2/events?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJpbmZyYV9lbnZfaWQiOiI2YWEzZGUxNS1hOTQ1LTRjNTgtODljOS02MDBkYzJmNWRmNTkifQ.fG3voLHggbgtCW9fQH1Y2vP5DSCOpo-t2pgDwvEe6Q7nE_Qp9-7BMKudpXiSTTYZCeWVE3s6nsAllP4IkK1ljA&infra_env_id=6aa3de15-a945-4c58-89c9-600dc2f5df59
  isoDownloadURL: https://assisted-image-service-multicluster-engine.apps.hub.5g-deployment.lab/images/6aa3de15-a945-4c58-89c9-600dc2f5df59?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJpbmZyYV9lbnZfaWQiOiI2YWEzZGUxNS1hOTQ1LTRjNTgtODljOS02MDBkYzJmNWRmNTkifQ.AFhkv4UPH0R0kGpBdZ8cqo8iNSH7z-CRsHsYbwQ6cVzjcnxnDRIEiout29UJOyt-lcCPPsOLW1YPKDh5GJ1Tqg&arch=x86_64&type=minimal-iso&version=4.11

oc get bmh sno2.5g-deployment.lab -n sno2 -o yaml

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
... REDACTED ...
  image:
    format: live-iso
    url: https://assisted-image-service-multicluster-engine.apps.hub.5g-deployment.lab/images/6aa3de15-a945-4c58-89c9-600dc2f5df59?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJpbmZyYV9lbnZfaWQiOiI2YWEzZGUxNS1hOTQ1LTRjNTgtODljOS02MDBkYzJmNWRmNTkifQ.AFhkv4UPH0R0kGpBdZ8cqo8iNSH7z-CRsHsYbwQ6cVzjcnxnDRIEiout29UJOyt-lcCPPsOLW1YPKDh5GJ1Tqg&arch=x86_64&type=minimal-iso&version=4.11
  online: true

More information about monitoring the status of the deployment can be found in the monitoring section.

Policies not showing in the Governance console

In cases where the policies are not shown in the Governance section of the Multicloud console we have to check first, if the policies Argo application was synced successfully. If not, repeat the steps detailed in the previous section

Verfiy that the policies in the hub cluster are similar to the ones shown below. Remember that inform as remediation is correct.

oc get policies -A

NAMESPACE      NAME                                           REMEDIATION ACTION   COMPLIANCE STATE   AGE
sno2           ztp-policies.common-operator-catalog-411       inform                                  14m
sno2           ztp-policies.group-du-sno-du-profile-wave1     inform                                  14m
sno2           ztp-policies.group-du-sno-du-profile-wave10    inform                                  14m
sno2           ztp-policies.site-sno2-performance-policy      inform                                  14m
sno2           ztp-policies.site-sno2-storage-configuration   inform                                  14m
sno2           ztp-policies.site-sno2-tuned-policy            inform                                  14m
sno2           ztp-policies.zone-europe-storage-operator      inform                                  14m
ztp-policies   common-operator-catalog-411                    inform                                  35m
ztp-policies   group-du-sno-du-profile-wave1                  inform                                  35m
ztp-policies   group-du-sno-du-profile-wave10                 inform                                  35m
ztp-policies   site-sno2-performance-policy                   inform                                  35m
ztp-policies   site-sno2-storage-configuration                inform                                  35m
ztp-policies   site-sno2-tuned-policy                         inform                                  35m
ztp-policies   zone-europe-storage-operator                   inform                                  35m

Policies not applied

In such cases it can be because of multiple errors. First, let’s check that the policies are shown in the Governance console.

If the policies show a warning message in the Cluster violations section, it is because the SNO2 server is still being provisioned. You can double check the status of the provisioning in the Infrastructure → Clusters section. Verify that there is not ztp-running label added yet.

In cases where the Governance console shows policies already assigned to SNO2, we should check the status of the TALM operator. Remember, that it is responsible of moving the policies from inform to enforce, so they are eventually applied. Check the status of the cluster-group-upgrades-controller-manager Pod and its logs:

oc get pods -n openshift-operators

NAME                                                        READY   STATUS    RESTARTS      AGE
cluster-group-upgrades-controller-manager-b757bcdb9-46xtx   2/2     Running   1 (24h ago)   24h
gitops-operator-controller-manager-cd79b49dc-tvdp6          1/1     Running   0             25h

Next, we can verify that a ClusterGroupUpgrade CR was created automatically by the TALM operator. If it is not created, it means that either the label is not set yet in the cluster or the operator is having issues. In the latest case, check the logs as explained previously.

oc get cgu -A

NAMESPACE     NAME   UPGRADE STATE         AGE
ztp-install   sno2   UpgradeNotCompleted   2m27s

Describing the CGU shows a lot of information about the current status of the configuration:

oc get cgu -n ztp-install sno2 -oyaml

apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
... REDACTED ...
status:
  computedMaxConcurrency: 1
  conditions:
  - lastTransitionTime: "2023-03-02T11:16:09Z"
    message: The ClusterGroupUpgrade CR has upgrade policies that are still non compliant
    reason: UpgradeNotCompleted
    status: "False"
    type: Ready
  copiedPolicies:
  - sno2-common-operator-catalog-411-2rtdv
  - sno2-group-du-sno-du-profile-wave1-4jrqr
  - sno2-group-du-sno-du-profile-wave10-xbbsz
  - sno2-site-sno2-performance-policy-gbpns
  - sno2-site-sno2-storage-configuration-7wwv4
  - sno2-site-sno2-tuned-policy-969jx
  - sno2-zone-europe-storage-operator-q2bfh
... REDACTED ...
    currentBatch: 1
    currentBatchRemediationProgress:
      sno2:
        policyIndex: 2
        state: InProgress
    currentBatchStartedAt: "2023-03-02T11:16:09Z"
    startedAt: "2023-03-02T11:16:09Z"

Verfiy that now there are two times the number of policies in the hub cluster. That’s because a enforce copy of each one of them was created.

oc get policies -A

NAMESPACE      NAME                                                    REMEDIATION ACTION   COMPLIANCE STATE   AGE
sno2           ztp-install.sno2-common-operator-catalog-411-2rtdv      enforce              Compliant          6m8s
sno2           ztp-install.sno2-group-du-sno-du-profile-wave1-4jrqr    enforce              Compliant          5m17s
sno2           ztp-install.sno2-group-du-sno-du-profile-wave10-xbbsz   enforce              Compliant          2m38s
sno2           ztp-install.sno2-site-sno2-performance-policy-gbpns     enforce              NonCompliant       68s
sno2           ztp-install.sno2-zone-europe-storage-operator-q2bfh     enforce              Compliant          4m59s
sno2           ztp-policies.common-operator-catalog-411                inform               Compliant          56m
sno2           ztp-policies.group-du-sno-du-profile-wave1              inform               Compliant          56m
sno2           ztp-policies.group-du-sno-du-profile-wave10             inform               Compliant          56m
sno2           ztp-policies.site-sno2-performance-policy               inform               NonCompliant       56m
sno2           ztp-policies.site-sno2-storage-configuration            inform               NonCompliant       56m
sno2           ztp-policies.site-sno2-tuned-policy                     inform               NonCompliant       56m
sno2           ztp-policies.zone-europe-storage-operator               inform               Compliant          56m
ztp-install    sno2-common-operator-catalog-411-2rtdv                  enforce              Compliant          6m8s
ztp-install    sno2-group-du-sno-du-profile-wave1-4jrqr                enforce              Compliant          6m8s
ztp-install    sno2-group-du-sno-du-profile-wave10-xbbsz               enforce              Compliant          6m8s
ztp-install    sno2-site-sno2-performance-policy-gbpns                 enforce              NonCompliant       6m8s
ztp-install    sno2-site-sno2-storage-configuration-7wwv4              enforce                                 6m8s
ztp-install    sno2-site-sno2-tuned-policy-969jx                       enforce                                 6m8s
ztp-install    sno2-zone-europe-storage-operator-q2bfh                 enforce              Compliant          6m8s
ztp-policies   common-operator-catalog-411                             inform               Compliant          77m
ztp-policies   group-du-sno-du-profile-wave1                           inform               Compliant          77m
ztp-policies   group-du-sno-du-profile-wave10                          inform               Compliant          77m
ztp-policies   site-sno2-performance-policy                            inform               NonCompliant       77m
ztp-policies   site-sno2-storage-configuration                         inform               NonCompliant       77m
ztp-policies   site-sno2-tuned-policy                                  inform               NonCompliant       77m
ztp-policies   zone-europe-storage-operator                            inform               Compliant          77m

Each enforce policy is being applied one by one. There can be cases where the Cluster violations or the Compliance status is not set for the enforced cluster. It takes time to move to the next one depending on the changes applied to the target cluster.

OLM Bug

If the SNO cluster policies are not moving to Compliant after a while, you may be hitting this bug.

You need to check the subscriptions status on the SNO clusters, in order to do that you need to get the kubeconfigs:

oc --kubeconfig ~/5g-deployment-lab/hub-kubeconfig -n sno1 extract secret/sno1-admin-kubeconfig --to=- > ~/5g-deployment-lab/sno1-kubeconfig
oc --kubeconfig ~/5g-deployment-lab/hub-kubeconfig -n sno2 extract secret/sno2-admin-kubeconfig --to=- > ~/5g-deployment-lab/sno2-kubeconfig

Probably your subscriptions are stuck and are showing a message like the one below:

Command below checks sno2, you may want to check the SNO where policies are stuck.

oc --kubeconfig ~/5g-deployment-lab/sno2-kubeconfig -n openshift-storage get subscriptions.operators.coreos.com odf-lvm-operator -o yaml

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  creationTimestamp: "2023-06-14T14:35:17Z"
  generation: 1
  labels:
    operators.coreos.com/lvms-operator.openshift-storage: ""
    test: test
  name: odf-lvm-operator
  namespace: openshift-storage
  resourceVersion: "70835"
  uid: f30a26a2-fdaf-4469-8271-d5d5ac0cb64c
spec:
  channel: stable-4.12
  installPlanApproval: Manual
  name: lvms-operator
  source: redhat-operator-index
  sourceNamespace: openshift-marketplace
status:
  catalogHealth:
  - catalogSourceRef:
      apiVersion: operators.coreos.com/v1alpha1
      kind: CatalogSource
      name: redhat-operator-index
      namespace: openshift-marketplace
      resourceVersion: "70778"
      uid: e9ebbd29-f28d-4619-ab71-66bed8e52de2
    healthy: true
    lastUpdated: "2023-06-14T15:34:53Z"
  conditions:
  - lastTransitionTime: "2023-06-14T15:34:53Z"
    message: all available catalogsources are healthy
    reason: AllCatalogSourcesHealthy
    status: "False"
    type: CatalogSourcesUnhealthy
  - message: 'failed to populate resolver cache from source community-operators/openshift-marketplace:
      failed to list bundles: rpc error: code = Unavailable desc = connection error:
      desc = "transport: Error while dialing dial tcp: lookup community-operators.openshift-marketplace.svc
      on 172.30.0.10:53: no such host"'
    reason: ErrorPreventedResolution
    status: "True"
    type: ResolutionFailed

If that’s the case you should restart OLM pods to get this fixed:

oc --kubeconfig ~/5g-deployment-lab/sno1-kubeconfig -n openshift-operator-lifecycle-manager delete pods --all

Once OLM is restarted the subscriptions will move to the desired stated.