Guided Exercise: Assess the Health of an OpenShift Cluster

Jul 12, 2023·

9 min read

Verify the health of an OpenShift cluster by querying the status of its cluster operators, nodes, pods, and systemd services. Also verify cluster events and alerts.

Outcomes

  • View the status and get information about cluster operators.

  • Retrieve information about cluster pods and nodes.

  • Retrieve the status of a node's systemd services.

  • View cluster events and alerts.

  • Retrieve debugging information for the cluster.

As the student user on the workstation machine, use the lab command to prepare your system for this exercise. This command ensures that all resources are available for this exercise.

[student@workstation ~]$ lab start cli-health

Procedure 2.3. Instructions

  1. Retrieve the status and view information about cluster operators.

    1. Log in to the OpenShift cluster as the admin user with the redhatocp password.

       [student@workstation ~]$ oc login -u admin -p redhatocp \
         https://api.ocp4.example.com:6443
       Login successful
       ...output omitted...
      
    2. List the operators that users installed in the OpenShift cluster.

       [student@workstation ~]$ oc get operators
       NAME                                 AGE
       metallb-operator.metallb-system      27d
       odf-lvm-operator.openshift-storage   27d
      
    3. List the cluster operators that are installed by default in the OpenShift cluster.

       [student@workstation ~]$ oc get clusteroperators
       NAME                        VERSION   AVAILABLE   PROGRESSING   DEGRADED   ...
       authentication              4.12.0    True        False         False      ...
       baremetal                   4.12.0    True        False         False      ...
       cloud-controller-manager    4.12.0    True        False         False      ...
       cloud-credential            4.12.0    True        False         False      ...
       cluster-autoscaler          4.12.0    True        False         False      ...
       config-operator             4.12.0    True        False         False      ...
       console                     4.12.0    True        False         False      ...
       control-plane-machine-set   4.12.0    True        False         False      ...
       csi-snapshot-controller     4.12.0    True        False         False      ...
       dns                         4.12.0    True        False         False      ...
       etcd                        4.12.0    True        False         False      ...
       ...output omitted...
      
    4. Use the describe command to view detailed information about the openshift-apiserver cluster operator, such as related objects, events, and version.

       [student@workstation ~]$ oc describe clusteroperators openshift-apiserver
       Name:         openshift-apiserver
       Namespace:
       Labels:       <none>
       Annotations:  exclude.release.openshift.io/internal-openshift-hosted: true
                     include.release.openshift.io/self-managed-high-availability: true
                     include.release.openshift.io/single-node-developer: true
       API Version:  config.openshift.io/v1
       Kind:         ClusterOperator
       Metadata:
       ...output omitted...
       Spec:
       Status:
         Conditions:
           Last Transition Time:  2023-02-09T22:41:08Z
           Message:               All is well
           Reason:                AsExpected
           Status:                False
           Type:                  Degraded
       ...output omitted...
         Extension:               <nil>
         Related Objects:
           Group:      operator.openshift.io
           Name:       cluster
           Resource:   openshiftapiservers
           Group:
           Name:       openshift-config
           Resource:   namespaces
           Group:
           Name:       openshift-config-managed
           Resource:   namespaces
           Group:
           Name:       openshift-apiserver-operator
           Resource:   namespaces
           Group:
           Name:       openshift-apiserver
           Resource:   namespaces
       ...output omitted...
         Versions:
           Name:     operator
           Version:  4.12.0
           Name:     openshift-apiserver
           Version:  4.12.0
       Events:       <none>
      

      The Related Objects attribute includes information about the name, resource type, and groups for objects that are related to the operator.

    5. List the pods in the openshift-apiserver-operator namespace. Then, view the detailed status of an openshift-apiserver-operator pod by using the JSON format and the jq command. Your pod names might differ.

       [student@workstation ~]$ oc get pods -n openshift-apiserver-operator
       NAME                                            READY   STATUS    RESTARTS   AGE
       openshift-apiserver-operator-7ddc8958fb-7m2kr   1/1     Running   11         27d
      
       [student@workstation ~]$ oc get pod -n openshift-apiserver-operator \
         openshift-apiserver-operator-7ddc8958fb-7m2kr \
         -o json | jq .status
       {
         "conditions": [
       ...output omitted...
           {
             "lastProbeTime": null,
             "lastTransitionTime": "2023-03-08T15:41:34Z",
             "status": "True",
             "type": "Ready"
           },
       ...output omitted...
         ],
         "containerStatuses": [
           {
       ...output omitted...
             "name": "openshift-apiserver-operator",
             "ready": true,
             "restartCount": 11,
             "started": true,
             "state": {
               "running": {
                 "startedAt": "2023-03-08T15:41:34Z"
               }
             }
           }
         ],
         "hostIP": "192.168.50.10",
         "phase": "Running",
         "podIP": "10.8.0.5",
       ...output omitted...
       }
      
  2. Retrieve the status, resource consumption, and events of cluster pods.

    1. List the memory and CPU usage of all pods in the cluster. Use the --sum option to print the sum of the resource usage. The resource usage on your system probably differs.

       [student@workstation ~]$ oc adm top pods -A --sum
       NAMESPACE              NAME                                     CPU(cores) MEMORY(bytes)metallb-system         controller-5f6dfd8c4f-ddr8v              0m         39Mi
       metallb-system         metallb-operator-controller-manager-...  1m         38Mi
       metallb-system         metallb-operator-webhook-server-...      1m         18Mi
       metallb-system         speaker-2dds4                            10m        94Mi
       ...output omitted...
                                                                       505m       8982Mi
      
    2. List the pods and their labels in the openshift-etcd namespace.

       [student@workstation ~]$ oc get pods -n openshift-etcd --show-labels
       NAME                   READY   STATUS      RESTARTS   AGE   LABELS
       etcd-master01          4/4     Running     40         27d   app=etcd,etcd=true,k8s-app=etcd,revision=3
       installer-2-master01   0/1     Completed   0          27d   app=installer
       installer-3-master01   0/1     Completed   0          27d   app=installer
      
    3. List the resource usage of the containers in the etcd-master01 pod in the openshift-etcd namespace. The resource usage on your system probably differs.

       [student@workstation ~]$ oc adm top pods etcd-master01 \
         -n openshift-etcd --containers
       POD             NAME           CPU(cores)   MEMORY(bytes)
       etcd-master01   POD            0m           0Mi
       etcd-master01   etcd           57m          1096Mi
       etcd-master01   etcd-metrics   7m           20Mi
       etcd-master01   etcd-readyz    4m           40Mi
       etcd-master01   etcdctl        0m           0Mi
      
    4. Display a list of all resources, their status, and their types in the openshift-monitoring namespace.

       [student@workstation ~]$ oc get all -n openshift-monitoring --show-kind
       NAME                                                         READY   STATUS    ...
       pod/alertmanager-main-0                                      6/6     Running   ...
       pod/cluster-monitoring-operator-56b769b58f-dtmqj             2/2     Running   ...
       pod/kube-state-metrics-75455b796c-8q28d                      3/3     Running   ...
       ...output omitted...
       NAME                                            TYPE        CLUSTER-IP       ...
       service/alertmanager-main                       ClusterIP   172.30.85.183    ...
       service/alertmanager-operated                   ClusterIP   None             ...
       service/cluster-monitoring-operator             ClusterIP   None             ...
       service/kube-state-metrics                      ClusterIP   None             ...
       ...output omitted...
      
    5. View the logs of the alertmanager-main-0 pod in the openshift-monitoring namespace. The logs might differ on your system.

       [student@workstation ~]$ oc logs alertmanager-main-0 -n openshift-monitoring
       ...output omitted...
       ts=2023-03-09T14:57:11.850Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
       ts=2023-03-09T14:57:11.850Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
      
    6. Retrieve the events for the openshift-image-registry namespace.

       [student@workstation ~]$ oc get events -n openshift-image-registry
       LAST SEEN   TYPE     REASON             OBJECT                            MESSAGE
       42m         Normal   Scheduled          pod/image-pruner-27972000-dg8qt   Successfully assigned openshift-image-registry/image-pruner-27972000-dg8qt to master01
       42m         Normal   AddedInterface     pod/image-pruner-27972000-dg8qt   Add eth0 [10.8.0.96/23] from ovn-kubernetes
       42m         Normal   Pulled             pod/image-pruner-27972000-dg8qt   Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1fc4...685b" already present on machine
       42m         Normal   Created            pod/image-pruner-27972000-dg8qt   Created container image-pruner
       ...output omitted...
      
  3. Retrieve information about cluster nodes.

    1. View the status of the nodes in the cluster.

       [student@workstation ~]$ oc get nodes
       NAME       STATUS   ROLES                         AGE   VERSION
       master01   Ready    control-plane,master,worker   27d   v1.25.4+77bec7a
      
    2. Retrieve the resource consumption of the master01 node. The resource usage on your system probably differs.

       [student@workstation ~]$ oc adm top node
       NAME       CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
       master01   781m         10%    11455Mi         60%
      
    3. Use a JSONPath filter to determine the capacity and allocatable CPU for the master01 node. The values might differ on your system.

       [student@workstation ~]$ oc get node master01 -o jsonpath=\
       'Allocatable: {.status.allocatable.cpu}{"\n"}'\
       'Capacity: {.status.capacity.cpu}{"\n"}'
       Allocatable: 7500m
       Capacity: 8
      
    4. Determine the number of allocatable pods for the node.

       [student@workstation ~]$ oc get node master01 -o jsonpath=\
         '{.status.allocatable.pods}{"\n"}'
       250
      
    5. Use the describe command to view the events, resource requests, and resource limits for the node. The output might differ on your system.

       [student@workstation ~]$ oc describe node master01
       ...output omitted...
       Allocated resources:
         (Total limits may be over 100 percent, i.e., overcommitted.)
         Resource           Requests       Limits
         --------           --------       ------
         cpu                3158m (42%)    980m (13%)
         memory             12667Mi (66%)  1250Mi (6%)
         ephemeral-storage  0 (0%)         0 (0%)
         hugepages-1Gi      0 (0%)         0 (0%)
         hugepages-2Mi      0 (0%)         0 (0%)
       Events:
         Type    Reason                     Age                  From     Message
         ----    ------                     ----                 ----     -------
         Normal  Starting                   106m                 kubelet  Starting kubelet.
         Normal  NodeHasSufficientMemory    106m (x9 over 106m)  kubelet  Node master01 status is now: NodeHasSufficientMemory
         Normal  NodeHasNoDiskPressure      106m (x7 over 106m)  kubelet  Node master01 status is now: NodeHasNoDiskPressure
         Normal  NodeHasSufficientPID       106m (x7 over 106m)  kubelet  Node master01 status is now: NodeHasSufficientPID
       ...output omitted...
      
  4. Retrieve the logs and status of the systemd services on the master01 node.

    1. Display the logs of the node. Filter the logs to show the most recent log for the crio service. The logs might differ on your system.

       [student@workstation ~]$ oc adm node-logs master01 -u crio --tail 1
       -- Logs begin at Thu 2023-02-09 21:19:09 UTC, end at Thu 2023-03-09 16:57:00 UTC. --
       Mar 09 02:39:29.158989 master01 crio[3201]: time="2023-03-09 02:39:29.158737393Z" level=info msg="Image status: &ImageStatusResponse
       ...output omitted...
      
    2. Display the two most recent logs of the kubelet service on the node. The logs might differ on your system.

       [student@workstation ~]$ oc adm node-logs master01 -u kubelet --tail 2
       -- Logs begin at Thu 2023-02-09 21:19:09 UTC, end at Thu 2023-03-09 16:59:16 UTC. --
       Mar 09 02:40:57.466711 master01 systemd[1]: Stopped Kubernetes Kubelet.
       Mar 09 02:40:57.466835 master01 systemd[1]: kubelet.service: Consumed 1h 27min 8.069s CPU time
       -- Logs begin at Thu 2023-02-09 21:19:09 UTC, end at Thu 2023-03-09 16:59:16 UTC. --
       Mar 09 16:58:52.133046 master01 kubenswrapper[3195]: I0309 16:58:52.132866    3195 kubelet_getters.go:182] "Pod status updated" pod="openshift-etcd/etcd-master01" status=Running
       Mar 09 16:58:52.133046 master01 kubenswrapper[3195]: I0309 16:58:52.132882    3195 kubelet_getters.go:182] "Pod status updated" pod="openshift-kube-apiserver/kube-apiserver-master01" status=Running
      
    3. Create a debug session for the node. The, use the chroot /host command to access the host binaries.

       [student@workstation ~]$ oc debug node/master01
       Temporary namespace openshift-debug-kzz4c is created for debugging node...
       Starting pod/master01-debug ...
       To use host binaries, run `chroot /host`
       Pod IP: 192.168.50.10
       If you don't see a command prompt, try pressing enter.
       sh-4.4# chroot /host
       sh-4.4#
      
    4. Verify the status of the kubelet service.

       sh-4.4# systemctl status kubelet
       ● kubelet.service - Kubernetes Kubelet
          Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
         Drop-In: /etc/systemd/system/kubelet.service.d
                  └─01-kubens.conf, 10-mco-default-madv.conf, 20-logging.conf, 20-nodenet.conf
          Active: active (running) since Thu 2023-03-09 14:54:51 UTC; 2h 8min ago
        Main PID: 3195 (kubelet)
           Tasks: 28 (limit: 127707)
          Memory: 540.7M
             CPU: 18min 32.117s
       ...output omitted...
      

      Press Ctrl+C to quit the command.

    5. Confirm that the crio service is active.

       sh-4.4# systemctl is-active crio
       active
      
    6. Exit the debug pod.

       sh-4.4# exit
       exit
       sh-4.4# exit
       exit
      
       Removing debug pod ...
       Temporary namespace openshift-debug-kzz4c was removed.
      
  5. Retrieve debugging information for the cluster.

    1. Retrieve debugging information of the cluster by using the oc adm must-gather command. Specify the /home/student/must-gather directory as the destination directory. This command might take several minutes to complete.

      Then, confirm that the debugging information exists in the destination directory.

       [student@workstation ~]$ oc adm must-gather --dest-dir /home/student/must-gather
       [must-gather      ] OUT Using must-gather plug-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:07d3...e94c
       ...output omitted...
       Reprinting Cluster State:
       When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
       ClusterID: 94ff22c1-88a0-44cf-90f6-0b7b8b545434
       ClusterVersion: Stable at "4.12.0"
       ClusterOperators:
           All healthy and stable
      
       [student@workstation ~]$ ls -la ~/must-gather/
       total 688
       drwxrwxr-x.  3 student student    174 Mar  9 12:23 .
       drwx------. 24 student student   4096 Mar  9 12:22 ..
       -rw-r--r--.  1 student student 691751 Mar  9 12:23 event-filter.html
       drwxrwxrwx. 13 student student   4096 Mar  9 12:23 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-07d3...e94c
       -rw-r--r--.  1 student student    111 Mar  9 12:23 timestamp
      
    2. Generate debugging information for the openshift-api cluster operator. Specify the /home/student/inspect directory as the destination directory. Limit the debugging information to the last five minutes.

      Then, confirm that the debugging information exists in the destination directory.

       [student@workstation ~]$ oc adm inspect clusteroperator/kube-apiserver \
         --dest-dir /home/student/inspect --since 5m
       Gathering data for ns/openshift-config...
       Gathering data for ns/openshift-config-managed...
       Gathering data for ns/openshift-kube-apiserver-operator...
       Gathering data for ns/openshift-kube-apiserver...
       Gathering data for ns/metallb-system...
       Gathering data for ns/openshift-monitoring...
       Gathering data for ns/openshift-machine-api...
       Gathering data for ns/openshift-multus...
       Gathering data for ns/openshift-cluster-node-tuning-operator...
       Gathering data for ns/openshift-cluster-storage-operator...
       Wrote inspect data to /home/student/inspect.
       [student@workstation ~]$ ls -la inspect/
       total 208
       drwxrwxr-x.  4 student student     98 Mar  9 12:33 .
       drwx------. 25 student student   4096 Mar  9 12:33 ..
       drwxrwxr-x.  8 student student    185 Mar  9 12:33 cluster-scoped-resources
       -rw-r--r--.  1 student student 198069 Mar  9 12:33 event-filter.html
       drwxrwxr-x. 12 student student   4096 Mar  9 12:33 namespaces
       -rw-r--r--.  1 student student    110 Mar  9 12:33 timestamp
      
    3. Delete the debugging information from your system.

       [student@workstation ~]$ rm -rf must-gather inspect
      

Finish

On the workstation machine, use the lab command to complete this exercise. This step is important to ensure that resources from previous exercises do not impact upcoming exercises.

[student@workstation ~]$ lab finish cli-health