Guided Exercise: Assess the Health of an OpenShift Cluster
Verify the health of an OpenShift cluster by querying the status of its cluster operators, nodes, pods, and systemd services. Also verify cluster events and alerts.
Outcomes
View the status and get information about cluster operators.
Retrieve information about cluster pods and nodes.
Retrieve the status of a node's systemd services.
View cluster events and alerts.
Retrieve debugging information for the cluster.
As the student
user on the workstation
machine, use the lab
command to prepare your system for this exercise. This command ensures that all resources are available for this exercise.
[student@workstation ~]$ lab start cli-health
Procedure 2.3. Instructions
Retrieve the status and view information about cluster operators.
Log in to the OpenShift cluster as the
admin
user with theredhatocp
password.[student@workstation ~]$ oc login -u admin -p redhatocp \ https://api.ocp4.example.com:6443 Login successful ...output omitted...
List the operators that users installed in the OpenShift cluster.
[student@workstation ~]$ oc get operators NAME AGE metallb-operator.metallb-system 27d odf-lvm-operator.openshift-storage 27d
List the cluster operators that are installed by default in the OpenShift cluster.
[student@workstation ~]$ oc get clusteroperators NAME VERSION AVAILABLE PROGRESSING DEGRADED ... authentication 4.12.0 True False False ... baremetal 4.12.0 True False False ... cloud-controller-manager 4.12.0 True False False ... cloud-credential 4.12.0 True False False ... cluster-autoscaler 4.12.0 True False False ... config-operator 4.12.0 True False False ... console 4.12.0 True False False ... control-plane-machine-set 4.12.0 True False False ... csi-snapshot-controller 4.12.0 True False False ... dns 4.12.0 True False False ... etcd 4.12.0 True False False ... ...output omitted...
Use the
describe
command to view detailed information about theopenshift-apiserver
cluster operator, such as related objects, events, and version.[student@workstation ~]$ oc describe clusteroperators openshift-apiserver Name: openshift-apiserver Namespace: Labels: <none> Annotations: exclude.release.openshift.io/internal-openshift-hosted: true include.release.openshift.io/self-managed-high-availability: true include.release.openshift.io/single-node-developer: true API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: ...output omitted... Spec: Status: Conditions: Last Transition Time: 2023-02-09T22:41:08Z Message: All is well Reason: AsExpected Status: False Type: Degraded ...output omitted... Extension: <nil> Related Objects: Group: operator.openshift.io Name: cluster Resource: openshiftapiservers Group: Name: openshift-config Resource: namespaces Group: Name: openshift-config-managed Resource: namespaces Group: Name: openshift-apiserver-operator Resource: namespaces Group: Name: openshift-apiserver Resource: namespaces ...output omitted... Versions: Name: operator Version: 4.12.0 Name: openshift-apiserver Version: 4.12.0 Events: <none>
The
Related Objects
attribute includes information about the name, resource type, and groups for objects that are related to the operator.List the pods in the
openshift-apiserver-operator
namespace. Then, view the detailed status of anopenshift-apiserver-operator
pod by using the JSON format and thejq
command. Your pod names might differ.[student@workstation ~]$ oc get pods -n openshift-apiserver-operator NAME READY STATUS RESTARTS AGE openshift-apiserver-operator-7ddc8958fb-7m2kr 1/1 Running 11 27d
[student@workstation ~]$ oc get pod -n openshift-apiserver-operator \ openshift-apiserver-operator-7ddc8958fb-7m2kr \ -o json | jq .status { "conditions": [ ...output omitted... { "lastProbeTime": null, "lastTransitionTime": "2023-03-08T15:41:34Z", "status": "True", "type": "Ready" }, ...output omitted... ], "containerStatuses": [ { ...output omitted... "name": "openshift-apiserver-operator", "ready": true, "restartCount": 11, "started": true, "state": { "running": { "startedAt": "2023-03-08T15:41:34Z" } } } ], "hostIP": "192.168.50.10", "phase": "Running", "podIP": "10.8.0.5", ...output omitted... }
Retrieve the status, resource consumption, and events of cluster pods.
List the memory and CPU usage of all pods in the cluster. Use the
--sum
option to print the sum of the resource usage. The resource usage on your system probably differs.[student@workstation ~]$ oc adm top pods -A --sum NAMESPACE NAME CPU(cores) MEMORY(bytes)metallb-system controller-5f6dfd8c4f-ddr8v 0m 39Mi metallb-system metallb-operator-controller-manager-... 1m 38Mi metallb-system metallb-operator-webhook-server-... 1m 18Mi metallb-system speaker-2dds4 10m 94Mi ...output omitted... 505m 8982Mi
List the pods and their labels in the
openshift-etcd
namespace.[student@workstation ~]$ oc get pods -n openshift-etcd --show-labels NAME READY STATUS RESTARTS AGE LABELS etcd-master01 4/4 Running 40 27d app=etcd,etcd=true,k8s-app=etcd,revision=3 installer-2-master01 0/1 Completed 0 27d app=installer installer-3-master01 0/1 Completed 0 27d app=installer
List the resource usage of the containers in the
etcd-master01
pod in theopenshift-etcd
namespace. The resource usage on your system probably differs.[student@workstation ~]$ oc adm top pods etcd-master01 \ -n openshift-etcd --containers POD NAME CPU(cores) MEMORY(bytes) etcd-master01 POD 0m 0Mi etcd-master01 etcd 57m 1096Mi etcd-master01 etcd-metrics 7m 20Mi etcd-master01 etcd-readyz 4m 40Mi etcd-master01 etcdctl 0m 0Mi
Display a list of all resources, their status, and their types in the
openshift-monitoring
namespace.[student@workstation ~]$ oc get all -n openshift-monitoring --show-kind NAME READY STATUS ... pod/alertmanager-main-0 6/6 Running ... pod/cluster-monitoring-operator-56b769b58f-dtmqj 2/2 Running ... pod/kube-state-metrics-75455b796c-8q28d 3/3 Running ... ...output omitted... NAME TYPE CLUSTER-IP ... service/alertmanager-main ClusterIP 172.30.85.183 ... service/alertmanager-operated ClusterIP None ... service/cluster-monitoring-operator ClusterIP None ... service/kube-state-metrics ClusterIP None ... ...output omitted...
View the logs of the
alertmanager-main-0
pod in theopenshift-monitoring
namespace. The logs might differ on your system.[student@workstation ~]$ oc logs alertmanager-main-0 -n openshift-monitoring ...output omitted... ts=2023-03-09T14:57:11.850Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml ts=2023-03-09T14:57:11.850Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
Retrieve the events for the
openshift-image-registry
namespace.[student@workstation ~]$ oc get events -n openshift-image-registry LAST SEEN TYPE REASON OBJECT MESSAGE 42m Normal Scheduled pod/image-pruner-27972000-dg8qt Successfully assigned openshift-image-registry/image-pruner-27972000-dg8qt to master01 42m Normal AddedInterface pod/image-pruner-27972000-dg8qt Add eth0 [10.8.0.96/23] from ovn-kubernetes 42m Normal Pulled pod/image-pruner-27972000-dg8qt Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1fc4...685b" already present on machine 42m Normal Created pod/image-pruner-27972000-dg8qt Created container image-pruner ...output omitted...
Retrieve information about cluster nodes.
View the status of the nodes in the cluster.
[student@workstation ~]$ oc get nodes NAME STATUS ROLES AGE VERSION master01 Ready control-plane,master,worker 27d v1.25.4+77bec7a
Retrieve the resource consumption of the
master01
node. The resource usage on your system probably differs.[student@workstation ~]$ oc adm top node NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% master01 781m 10% 11455Mi 60%
Use a JSONPath filter to determine the capacity and allocatable CPU for the
master01
node. The values might differ on your system.[student@workstation ~]$ oc get node master01 -o jsonpath=\ 'Allocatable: {.status.allocatable.cpu}{"\n"}'\ 'Capacity: {.status.capacity.cpu}{"\n"}' Allocatable: 7500m Capacity: 8
Determine the number of allocatable pods for the node.
[student@workstation ~]$ oc get node master01 -o jsonpath=\ '{.status.allocatable.pods}{"\n"}' 250
Use the
describe
command to view the events, resource requests, and resource limits for the node. The output might differ on your system.[student@workstation ~]$ oc describe node master01 ...output omitted... Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 3158m (42%) 980m (13%) memory 12667Mi (66%) 1250Mi (6%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Starting 106m kubelet Starting kubelet. Normal NodeHasSufficientMemory 106m (x9 over 106m) kubelet Node master01 status is now: NodeHasSufficientMemory Normal NodeHasNoDiskPressure 106m (x7 over 106m) kubelet Node master01 status is now: NodeHasNoDiskPressure Normal NodeHasSufficientPID 106m (x7 over 106m) kubelet Node master01 status is now: NodeHasSufficientPID ...output omitted...
Retrieve the logs and status of the systemd services on the
master01
node.Display the logs of the node. Filter the logs to show the most recent log for the
crio
service. The logs might differ on your system.[student@workstation ~]$ oc adm node-logs master01 -u crio --tail 1 -- Logs begin at Thu 2023-02-09 21:19:09 UTC, end at Thu 2023-03-09 16:57:00 UTC. -- Mar 09 02:39:29.158989 master01 crio[3201]: time="2023-03-09 02:39:29.158737393Z" level=info msg="Image status: &ImageStatusResponse ...output omitted...
Display the two most recent logs of the
kubelet
service on the node. The logs might differ on your system.[student@workstation ~]$ oc adm node-logs master01 -u kubelet --tail 2 -- Logs begin at Thu 2023-02-09 21:19:09 UTC, end at Thu 2023-03-09 16:59:16 UTC. -- Mar 09 02:40:57.466711 master01 systemd[1]: Stopped Kubernetes Kubelet. Mar 09 02:40:57.466835 master01 systemd[1]: kubelet.service: Consumed 1h 27min 8.069s CPU time -- Logs begin at Thu 2023-02-09 21:19:09 UTC, end at Thu 2023-03-09 16:59:16 UTC. -- Mar 09 16:58:52.133046 master01 kubenswrapper[3195]: I0309 16:58:52.132866 3195 kubelet_getters.go:182] "Pod status updated" pod="openshift-etcd/etcd-master01" status=Running Mar 09 16:58:52.133046 master01 kubenswrapper[3195]: I0309 16:58:52.132882 3195 kubelet_getters.go:182] "Pod status updated" pod="openshift-kube-apiserver/kube-apiserver-master01" status=Running
Create a debug session for the node. The, use the
chroot /host
command to access the host binaries.[student@workstation ~]$ oc debug node/master01 Temporary namespace openshift-debug-kzz4c is created for debugging node... Starting pod/master01-debug ... To use host binaries, run `chroot /host` Pod IP: 192.168.50.10 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4#
Verify the status of the
kubelet
service.sh-4.4# systemctl status kubelet ● kubelet.service - Kubernetes Kubelet Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/kubelet.service.d └─01-kubens.conf, 10-mco-default-madv.conf, 20-logging.conf, 20-nodenet.conf Active: active (running) since Thu 2023-03-09 14:54:51 UTC; 2h 8min ago Main PID: 3195 (kubelet) Tasks: 28 (limit: 127707) Memory: 540.7M CPU: 18min 32.117s ...output omitted...
Press Ctrl+C to quit the command.
Confirm that the
crio
service is active.sh-4.4# systemctl is-active crio active
Exit the debug pod.
sh-4.4# exit exit sh-4.4# exit exit Removing debug pod ... Temporary namespace openshift-debug-kzz4c was removed.
Retrieve debugging information for the cluster.
Retrieve debugging information of the cluster by using the
oc adm must-gather
command. Specify the/home/student/must-gather
directory as the destination directory. This command might take several minutes to complete.Then, confirm that the debugging information exists in the destination directory.
[student@workstation ~]$ oc adm must-gather --dest-dir /home/student/must-gather [must-gather ] OUT Using must-gather plug-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:07d3...e94c ...output omitted... Reprinting Cluster State: When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information: ClusterID: 94ff22c1-88a0-44cf-90f6-0b7b8b545434 ClusterVersion: Stable at "4.12.0" ClusterOperators: All healthy and stable [student@workstation ~]$ ls -la ~/must-gather/ total 688 drwxrwxr-x. 3 student student 174 Mar 9 12:23 . drwx------. 24 student student 4096 Mar 9 12:22 .. -rw-r--r--. 1 student student 691751 Mar 9 12:23 event-filter.html drwxrwxrwx. 13 student student 4096 Mar 9 12:23 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-07d3...e94c -rw-r--r--. 1 student student 111 Mar 9 12:23 timestamp
Generate debugging information for the
openshift-api
cluster operator. Specify the/home/student/inspect
directory as the destination directory. Limit the debugging information to the last five minutes.Then, confirm that the debugging information exists in the destination directory.
[student@workstation ~]$ oc adm inspect clusteroperator/kube-apiserver \ --dest-dir /home/student/inspect --since 5m Gathering data for ns/openshift-config... Gathering data for ns/openshift-config-managed... Gathering data for ns/openshift-kube-apiserver-operator... Gathering data for ns/openshift-kube-apiserver... Gathering data for ns/metallb-system... Gathering data for ns/openshift-monitoring... Gathering data for ns/openshift-machine-api... Gathering data for ns/openshift-multus... Gathering data for ns/openshift-cluster-node-tuning-operator... Gathering data for ns/openshift-cluster-storage-operator... Wrote inspect data to /home/student/inspect. [student@workstation ~]$ ls -la inspect/ total 208 drwxrwxr-x. 4 student student 98 Mar 9 12:33 . drwx------. 25 student student 4096 Mar 9 12:33 .. drwxrwxr-x. 8 student student 185 Mar 9 12:33 cluster-scoped-resources -rw-r--r--. 1 student student 198069 Mar 9 12:33 event-filter.html drwxrwxr-x. 12 student student 4096 Mar 9 12:33 namespaces -rw-r--r--. 1 student student 110 Mar 9 12:33 timestamp
Delete the debugging information from your system.
[student@workstation ~]$ rm -rf must-gather inspect
Finish
On the workstation
machine, use the lab
command to complete this exercise. This step is important to ensure that resources from previous exercises do not impact upcoming exercises.
[student@workstation ~]$ lab finish cli-health