Standard Operating Procedures

The following guide uses the terminology from the Site Reliability Engineer book.

The Camel K operator exposes a monitoring endpoint, that publishes metrics indicating the level of service provided to its users. These metrics materialize the Service Level Indicators (SLIs) for the Camel K operator.

Service Level Objectives (SLOs) can be defined based on these SLIs. The default alerts created for the Camel K operator query the SLIs corresponding metrics, and match the SLOs for the Camel K operator, so that they fire up as soon as the level of service is not met, and preemptive measures can be taken before beaching the Service Level Agreement (SLA) for the Camel K operator.

Operator SOPs

The following section lists the Standard Operating Procedures (SOPs), corresponding to the default alerts, created for the Camel K operator. It assumes the operator has been installed according to the installation section from the operator monitoring documentation.

It documents the recommended troubleshooting actions, to be performed when a particular alert fires. It is meant to be a living document, to be improved iteratively over time, as users face problematic situations, and actions to troubleshoot and solve them are perfected.

The commands in the following section rely on the jq tool, to process the output of the kubectl commands. You can refer to the download instructions from the tool website.

CamelKReconciliationDuration

Description

This alert has severity level of "warning". It’s firing when more than 10% of the reconciliation requests have their duration above 0.5s.

Troubleshooting

Check the rate(camel_k_reconciliation_duration_seconds_bucket{le="0.5"}[5m]) SLI, and identify the resource kinds for which the duration is longer than 0.5s.
Improve this SOP if there’s anything missing, and contact the team if there are any changes that could make this easier in the future.

CamelKReconciliationFailure

Description

This alert has severity level of "warning". It’s firing when some reconciliation requests have failed.

Troubleshooting

Check the camel_k_reconciliation_duration_seconds_count{result="Errored"} SLI, and identify the kind label(s) for which the value is not zero.

Search the operator logs for errors, e.g.:

$ kubectl logs deployment/camel-k-operator --since=1h \
| jq -R 'fromjson?
| select(.level == "error")'

Check the error, errorVerbose and stacktrace fields.

Inspect the resources corresponding to the errors, e.g.:

$ kubectl logs deployment/camel-k-operator --since=1h \
| jq -rR 'fromjson?
| select(.level == "error")
| [{namespace, name, controller}]
| unique
| .[]
| "-n \(.namespace) \(.controller | rtrimstr("-controller"))/\(.name)"' \
| xargs -L1 kubectl describe

Check the resource specification and events.

Improve this SOP if there’s anything missing, and contact the team if there are any changes that could make this easier in the future.

CamelKSuccessBuildDuration2m

Description

This alert has severity level of "warning". It’s firing when more than 10% of the successful builds have their duration above 2 min.

Troubleshooting

Inspect the successful Builds whose duration is longer than 2 minutes, e.g.:

$ kubectl get builds.camel.apache.org -o json \
| jq -r '.items[]
| select(.status.phase == "Succeeded")
| select(.status.duration
  | "01-Jan-1970 \(sub("(?<time>.*)\\..*"; "\(.time)s"))" | strptime("%d-%b-%Y %Mm%Ss")? // strptime("%d-%b-%Y %Ss")
  | mktime > 120)
| "-n \(.metadata.namespace) builds.camel.apache.org/\(.metadata.name)"' \
| xargs -L1 kubectl describe

Check the resource specification and events.

Improve this SOP if there’s anything missing, and contact the team if there are any changes that could make this easier in the future.

CamelKSuccessBuildDuration5m

Description

This alert has severity level of "critical". It’s firing when more than 1% of the successful builds have their duration above 5 min.

Troubleshooting

Inspect the successful Builds whose duration is longer than 5 minutes, e.g.:

$ kubectl get builds.camel.apache.org -o json \
| jq -r '.items[]
| select(.status.phase == "Succeeded")
| select(.status.duration
  | "01-Jan-1970 \(sub("(?<time>.*)\\..*"; "\(.time)s"))" | strptime("%d-%b-%Y %Mm%Ss")? // strptime("%d-%b-%Y %Ss")
  | mktime > 300)
| "-n \(.metadata.namespace) builds.camel.apache.org/\(.metadata.name)"' \
| xargs -L1 kubectl describe

Check the resource specification and events.

Improve this SOP if there’s anything missing, and contact the team if there are any changes that could make this easier in the future.

CamelKBuildError

Description

This alert has severity level of "critical". It’s firing when more than 1% of the builds have errored over at least 10 min.

Troubleshooting

Inspect the errored Builds, e.g.:

$ kubectl get builds.camel.apache.org -o json \
| jq -r '.items[]
| select(.status.phase == "Error")
| "-n \(.metadata.namespace) builds.camel.apache.org/\(.metadata.name)"' \
| xargs -L1 kubectl get -o jsonpath='{.metadata.namespace}{"/"}{.metadata.name}{"\nError: "}{.status.error}{"\n"}'

Check the error message.

Improve this SOP if there’s anything missing, and contact the team if there are any changes that could make this easier in the future.

CamelKBuildQueueDuration1m

Description

This alert has severity level of "warning". It’s firing when more than 1% of the builds have been queued for more than 1 min.

Troubleshooting

Inspect the Builds that have been queued for more than 1 minutes, e.g.:

$ kubectl get builds.camel.apache.org -o json \
| jq -r '.items[]
| select(
  (.status.startedAt | strptime("%Y-%m-%dT%H:%M:%SZ") | mktime) -
  (.status.failure.recovery.attemptTime? // .metadata.creationTimestamp | strptime("%Y-%m-%dT%H:%M:%SZ")
  | mktime) > 60)
| "-n \(.metadata.namespace) builds.camel.apache.org/\(.metadata.name)"' \
| xargs -L1 kubectl describe

Check the resource specification and events.

Improve this SOP if there’s anything missing, and contact the team if there are any changes that could make this easier in the future.

CamelKBuildQueueDuration5m

Description

This alert has severity level of "critical". It’s firing when more than 1% of the builds have been queued for more than 5 min.

Troubleshooting

Inspect the Builds that have been queued for more than 5 minutes, e.g.:

$ kubectl get builds.camel.apache.org -o json \
| jq -r '.items[]
| select(
  (.status.startedAt | strptime("%Y-%m-%dT%H:%M:%SZ") | mktime) -
  (.status.failure.recovery.attemptTime? // .metadata.creationTimestamp | strptime("%Y-%m-%dT%H:%M:%SZ")
  | mktime) > 300)
| "-n \(.metadata.namespace) builds.camel.apache.org/\(.metadata.name)"' \
| xargs -L1 kubectl describe

Check the resource specification and events.

Improve this SOP if there’s anything missing, and contact the team if there are any changes that could make this easier in the future.