Lab: Configure Applications for Reliability
Deploy and troubleshoot a reliable application that defines health probes, compute resource requests, and compute resource limits so it can run N instances per node; and configure a horizontal pod autoscaler that will scale to a maximum of N instances.
Outcomes
You should be able to add resource requests to a Deployment
object, configure probes, and create a horizontal pod autoscaler resource.
As the student
user on the workstation
machine, use the lab
command to prepare your system for this exercise.
This command ensures that all resources are available for this exercise. It also creates the reliability-review
project and deploys the longload
application in that project.
[student@workstation ~]$ lab start reliability-review
Procedure 6.6. Instructions
The API URL of your OpenShift cluster is https://api.ocp4.example.com:6443, and the oc
command is already installed on your workstation
machine.
Log in to the OpenShift cluster as the developer
user with the developer
password.
Use the reliability-review
project for your work.
The
longload
application in thereliability-review
project fails to start. Diagnose and then fix the issue. The application needs 512 MiB of memory to work.After you fix the issue, you can confirm that the application works by running the
~/DO180/labs/reliability-review/curl_loop.sh
script that thelab
command prepared. The script sends requests to the application in a loop. For each request, the script displays the pod name and the application status. Press Ctrl+C to quit the script.Log in to the OpenShift cluster.
[student@workstation ~]$ oc login -u developer -p developer \ https://api.ocp4.example.com:6443 Login successful. ...output omitted...
Set the
reliability-review
project as the active project.[student@workstation ~]$ oc project reliability-review ...output omitted...
List the pods in the project. The pod is in the
Pending
status. The name of the pod on your system probably differs.[student@workstation ~]$ oc get pods NAME READY STATUS RESTARTS AGE longload-64bf8dd776-b6rkz 0/1 Pending 0 8m1s
Retrieve the events for the pod. No compute node has enough memory to accommodate the pod.
[student@workstation ~]$ oc describe pod longload-64bf8dd776-b6rkz Name: longload-64bf8dd776-b6rkz Namespace: reliability-review ...output omitted... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 8m default-scheduler 0/1 nodes are available: 1 Insufficient memory. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
Review the resource requests for memory. The
longload
deployment requests 8 GiB of memory.[student@workstation ~]$ oc get deployment longload -o \ jsonpath='{.spec.template.spec.containers[0].resources.requests.memory}{"\n"}' 8Gi
Set the memory requests to 512 MiB. Ignore the warning message.
[student@workstation ~]$ oc set resources deployment/longload \ --requests memory=512Mi Warning: would violate PodSecurity "restricted:v1.24": ...output omitted... deployment.apps/longload resource requirements updated
Wait for the pod to start. You might have to rerun the command several times for the pod to report a
Running
status. The name of the pod on your system probably differs.[student@workstation ~]$ oc get pods NAME READY STATUS RESTARTS AGE longload-5897c9558f-cx4gt 1/1 Running 0 86s
Run the
~/DO180/labs/reliability-review/curl_loop.sh
script to confirm that the application works.[student@workstation ~]$ ~/DO180/labs/reliability-review/curl_loop.sh 1 curl: (7) Failed to connect to master01.ocp4.example.com port 30372: Connection refused 2 longload-5897c9558f-cx4gt: app is still starting 3 longload-5897c9558f-cx4gt: app is still starting 4 longload-5897c9558f-cx4gt: app is still starting 5 longload-5897c9558f-cx4gt: Ok 6 longload-5897c9558f-cx4gt: Ok 7 longload-5897c9558f-cx4gt: Ok 8 longload-5897c9558f-cx4gt: Ok ...output omitted...
Press Ctrl+C to quit the script.
When the application scales up, your customers complain that some requests fail. To replicate the issue, manually scale up the
longload
application to three replicas, and run the~/DO180/labs/reliability-review/curl_loop.sh
script at the same time.The application takes seven seconds to initialize. The application exposes the
/health
API endpoint on HTTP port 3000. Configure thelongload
deployment to use this endpoint, to ensure that the application is ready before serving client requests.Open a new terminal window and run the
~/DO180/labs/reliability-review/curl_loop.sh
script.[student@workstation ~]$ ~/DO180/labs/reliability-review/curl_loop.sh 1 longload-5897c9558f-cx4gt: Ok 2 longload-5897c9558f-cx4gt: Ok 3 longload-5897c9558f-cx4gt: Ok 4 longload-5897c9558f-cx4gt: Ok ...output omitted...
Leave the script running and do not interrupt it.
Scale up the application to three replicas.
[student@workstation ~]$ oc scale deployment/longload --replicas 3 deployment.apps/longload scaled
Watch the output of the
curl_loop.sh
script in the second terminal. Some requests fail because OpenShift sends requests to the new pods before the application is ready....output omitted... 22 longload-5897c9558f-cx4gt: Ok 23 longload-5897c9558f-cx4gt: Ok 24 longload-5897c9558f-cx4gt: Ok 25 curl: (7) Failed to connect to master01.ocp4.example.com port 30372: Connection refused 26 curl: (7) Failed to connect to master01.ocp4.example.com port 30372: Connection refused 27 longload-5897c9558f-cx4gt: Ok 28 curl: (7) Failed to connect to master01.ocp4.example.com port 30372: Connection refused 29 longload-5897c9558f-cx4gt: Ok 30 curl: (7) Failed to connect to master01.ocp4.example.com port 30372: Connection refused 31 longload-5897c9558f-tpssf: app is still starting 32 longload-5897c9558f-kkvm5: app is still starting 33 longload-5897c9558f-cx4gt: Ok 34 longload-5897c9558f-tpssf: app is still starting 35 longload-5897c9558f-tpssf: app is still starting 36 longload-5897c9558f-tpssf: app is still starting 37 longload-5897c9558f-cx4gt: Ok 38 longload-5897c9558f-tpssf: app is still starting 39 longload-5897c9558f-cx4gt: Ok 40 longload-5897c9558f-cx4gt: Ok ...output omitted...
Leave the script running and do not interrupt it.
Add a readiness probe to the
longload
deployment. Ignore the warning message.[student@workstation ~]$ oc set probe deployment/longload --readiness \ --initial-delay-seconds 7 \ --get-url http://:3000/health Warning: would violate PodSecurity "restricted:v1.24": ...output omitted... deployment.apps/longload probes updated
Scale down the application back to one pod.
[student@workstation ~]$ oc scale deployment/longload --replicas 1 deployment.apps/longload scaled
To test your work, scale up the application to three replicas again.
[student@workstation ~]$ oc scale deployment/longload --replicas 3 deployment.apps/longload scaled
Watch the output of the
curl_loop.sh
script in the second terminal. No request fails....output omitted... 92 longload-7ddcc9b7fd-72dtm: Ok 93 longload-7ddcc9b7fd-72dtm: Ok 94 longload-7ddcc9b7fd-72dtm: Ok 95 longload-7ddcc9b7fd-qln95: Ok 96 longload-7ddcc9b7fd-wrxrb: Ok 97 longload-7ddcc9b7fd-qln95: Ok 98 longload-7ddcc9b7fd-wrxrb: Ok 99 longload-7ddcc9b7fd-72dtm: Ok ...output omitted...
Press Ctrl+C to quit the script.
Configure the application so that it automatically scales up when the average memory usage is above 60% of the memory requests value, and scales down when the usage is below this percentage. The minimum number of replicas must be one, and the maximum must be three. The resource that you create for scaling the application must be named
longload
.The
lab
command provides the~/DO180/labs/reliability-review/hpa.yml
resource file as an example. Use theoc explain
command to learn the valid parameters for thehpa.spec.metrics.resource.target
attribute. Because the file is incomplete, you must update it first if you choose to use it.To test your work, use the
~/DO180/labs/reliability-review/allocate.sh
script that thelab
command prepared. This script sends an HTTP request to the application/leak
API endpoint. Each request consumes an additional 480 MiB of memory. To free this memory, you can use the~/DO180/labs/reliability-review/free.sh
script.Before you create the horizontal pod autoscaler resource, scale down the application to one pod.
[student@workstation ~]$ oc scale deployment/longload --replicas 1 deployment.apps/longload scaled
Edit the
~/DO180/labs/reliability-review/hpa.yml
resource file. You can retrieve the parameters for theresource
attribute by using theoc explain hpa.spec.metrics.resource
andoc explain
hpa.spec.metrics.resource.target
commands.apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: longload labels: app: longload spec: maxReplicas: 3 minReplicas: 1 scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: longload metrics: - type: Resource resource: name: memory target: type: Utilization averageUtilization: 60
Use the
oc apply
command to deploy the horizontal pod autoscaler.[student@workstation ~]$ oc apply -f ~/DO180/labs/reliability-review/hpa.yml horizontalpodautoscaler.autoscaling/longload created
In the second terminal, run the
watch
command to monitor theoc get hpa longload
command. Wait for thelongload
horizontal pod autoscaler to report usage in theTARGETS
column. The percentage on your system probably differs.[student@workstation ~]$ watch oc get hpa longload Every 2.0s: oc get hpa longload workstation: Fri Mar 10 05:15:34 2023 NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE longload Deployment/longload 13%/60% 1 3 1 75s
Leave the command running and do not interrupt it.
To test your work, run the
~/DO180/labs/reliability-review/allocate.sh
script in the first terminal for the application to allocate 480 MiB of memory.[student@workstation ~]$ ~/DO180/labs/reliability-review/allocate.sh longload-7ddcc9b7fd-72dtm: consuming memory!
In the second terminal, after two minutes, the
oc get hpa longload
command shows the memory increase. The horizontal pod autoscaler scales up the application to more than one replica. The percentage on your system probably differs.Every 2.0s: oc get hpa longload workstation: Fri Mar 10 05:19:44 2023 NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE longload Deployment/longload 145%/60% 1 3 2 5m18s
Press Ctrl+C to quit the
watch
command. Close that second terminal when done.
Evaluation
As the student
user on the workstation
machine, use the lab
command to grade your work. Correct any reported failures and rerun the command until successful.
[student@workstation ~]$ lab grade reliability-review
Finish
As the student
user on the workstation
machine, use the lab
command to complete this exercise. This step is important to ensure that resources from previous exercises do not impact upcoming exercises.
[student@workstation ~]$ lab finish reliability-review