Standard Operating Procedures
The following guide uses the terminology from the Site Reliability Engineer book. |
The Camel K operator exposes a monitoring endpoint, that publishes metrics indicating the level of service provided to its users. These metrics materialize the Service Level Indicators (SLIs) for the Camel K operator.
Service Level Objectives (SLOs) can be defined based on these SLIs. The default alerts created for the Camel K operator query the SLIs corresponding metrics, and match the SLOs for the Camel K operator, so that they fire up as soon as the level of service is not met, and preemptive measures can be taken before beaching the Service Level Agreement (SLA) for the Camel K operator.
Operator SOPs
The following section lists the Standard Operating Procedures (SOPs), corresponding to the default alerts, created for the Camel K operator. It assumes the operator has been installed according to the installation section from the operator monitoring documentation.
It documents the recommended troubleshooting actions, to be performed when a particular alert fires. It is meant to be a living document, to be improved iteratively over time, as users face problematic situations, and actions to troubleshoot and solve them are perfected.
The commands in the following section rely on the jq tool, to process the output of the kubectl commands. You can refer to the download instructions from the tool website. |
CamelKReconciliationDuration
Description
This alert has severity level of "warning". It’s firing when more than 10% of the reconciliation requests have their duration above 0.5s.
Troubleshooting
-
Check the
rate(camel_k_reconciliation_duration_seconds_bucket{le="0.5"}[5m])
SLI, and identify the resource kinds for which the duration is longer than 0.5s. -
Improve this SOP if there’s anything missing, and contact the team if there are any changes that could make this easier in the future.
CamelKReconciliationFailure
Description
This alert has severity level of "warning". It’s firing when some reconciliation requests have failed.
Troubleshooting
-
Check the
camel_k_reconciliation_duration_seconds_count{result="Errored"}
SLI, and identify thekind
label(s) for which the value is not zero. -
Search the operator logs for errors, e.g.:
$ kubectl logs deployment/camel-k-operator --since=1h \ | jq -R 'fromjson? | select(.level == "error")'
Check the
error
,errorVerbose
andstacktrace
fields. -
Inspect the resources corresponding to the errors, e.g.:
$ kubectl logs deployment/camel-k-operator --since=1h \ | jq -rR 'fromjson? | select(.level == "error") | [{namespace, name, controller}] | unique | .[] | "-n \(.namespace) \(.controller | rtrimstr("-controller"))/\(.name)"' \ | xargs -L1 kubectl describe
Check the resource specification and events.
-
Improve this SOP if there’s anything missing, and contact the team if there are any changes that could make this easier in the future.
CamelKSuccessBuildDuration2m
Description
This alert has severity level of "warning". It’s firing when more than 10% of the successful builds have their duration above 2 min.
Troubleshooting
-
Inspect the successful Builds whose duration is longer than 2 minutes, e.g.:
$ kubectl get builds.camel.apache.org -o json \ | jq -r '.items[] | select(.status.phase == "Succeeded") | select(.status.duration | "01-Jan-1970 \(sub("(?<time>.*)\\..*"; "\(.time)s"))" | strptime("%d-%b-%Y %Mm%Ss")? // strptime("%d-%b-%Y %Ss") | mktime > 120) | "-n \(.metadata.namespace) builds.camel.apache.org/\(.metadata.name)"' \ | xargs -L1 kubectl describe
Check the resource specification and events.
-
Improve this SOP if there’s anything missing, and contact the team if there are any changes that could make this easier in the future.
CamelKSuccessBuildDuration5m
Description
This alert has severity level of "critical". It’s firing when more than 1% of the successful builds have their duration above 5 min.
Troubleshooting
-
Inspect the successful Builds whose duration is longer than 5 minutes, e.g.:
$ kubectl get builds.camel.apache.org -o json \ | jq -r '.items[] | select(.status.phase == "Succeeded") | select(.status.duration | "01-Jan-1970 \(sub("(?<time>.*)\\..*"; "\(.time)s"))" | strptime("%d-%b-%Y %Mm%Ss")? // strptime("%d-%b-%Y %Ss") | mktime > 300) | "-n \(.metadata.namespace) builds.camel.apache.org/\(.metadata.name)"' \ | xargs -L1 kubectl describe
Check the resource specification and events.
-
Improve this SOP if there’s anything missing, and contact the team if there are any changes that could make this easier in the future.
CamelKBuildError
Description
This alert has severity level of "critical". It’s firing when more than 1% of the builds have errored over at least 10 min.
Troubleshooting
-
Inspect the errored Builds, e.g.:
$ kubectl get builds.camel.apache.org -o json \ | jq -r '.items[] | select(.status.phase == "Error") | "-n \(.metadata.namespace) builds.camel.apache.org/\(.metadata.name)"' \ | xargs -L1 kubectl get -o jsonpath='{.metadata.namespace}{"/"}{.metadata.name}{"\nError: "}{.status.error}{"\n"}'
Check the error message.
-
Improve this SOP if there’s anything missing, and contact the team if there are any changes that could make this easier in the future.
CamelKBuildQueueDuration1m
Description
This alert has severity level of "warning". It’s firing when more than 1% of the builds have been queued for more than 1 min.
Troubleshooting
-
Inspect the Builds that have been queued for more than 1 minutes, e.g.:
$ kubectl get builds.camel.apache.org -o json \ | jq -r '.items[] | select( (.status.startedAt | strptime("%Y-%m-%dT%H:%M:%SZ") | mktime) - (.status.failure.recovery.attemptTime? // .metadata.creationTimestamp | strptime("%Y-%m-%dT%H:%M:%SZ") | mktime) > 60) | "-n \(.metadata.namespace) builds.camel.apache.org/\(.metadata.name)"' \ | xargs -L1 kubectl describe
Check the resource specification and events.
-
Improve this SOP if there’s anything missing, and contact the team if there are any changes that could make this easier in the future.
CamelKBuildQueueDuration5m
Description
This alert has severity level of "critical". It’s firing when more than 1% of the builds have been queued for more than 5 min.
Troubleshooting
-
Inspect the Builds that have been queued for more than 5 minutes, e.g.:
$ kubectl get builds.camel.apache.org -o json \ | jq -r '.items[] | select( (.status.startedAt | strptime("%Y-%m-%dT%H:%M:%SZ") | mktime) - (.status.failure.recovery.attemptTime? // .metadata.creationTimestamp | strptime("%Y-%m-%dT%H:%M:%SZ") | mktime) > 300) | "-n \(.metadata.namespace) builds.camel.apache.org/\(.metadata.name)"' \ | xargs -L1 kubectl describe
Check the resource specification and events.
-
Improve this SOP if there’s anything missing, and contact the team if there are any changes that could make this easier in the future.