A very interesting and widely used project in the Kubernetes space is kubernetes-mixin. The project, written in Jsonnet, provides a well-tested and customisable set of Prometheus rules, alerts, and Grafana dashboards.
An important set of rules are the ones regarding the Kubernetes API Server, providing valuable insights, such as the server’s availability, error budget, workqueues length and more.
I recently made a small contribution to that project, solving an issue I had been seeing for a while. Solving the issue forced me to deep-dive into the Prometheus rules details, and I am writing this to share what I learned.
What I observed happening in some very large clusters, was that the 30-day Availability and error budget from time to time went (slightly) above 100%.
<aside> <img src="/icons/git_orange.svg" alt="/icons/git_orange.svg" width="40px" />
Github Issue
The problem happened sporadically, especially on large and heavily used clusters.
The kubernetes-mixin splits the API server availability into three views
verb=”write”
), i.e. requests with the verb
label matching selector verb=~"POST|PUT|PATCH|DELETE"
verb=”read”
), i.e. requests with the verb
label matching selector verb=~"LIST|GET"
verb=”all”
) availability.The problem affected all three views of the availability, but in the following I will refer only to the write one, for the sake of simplicity, since it has the easiest expression of the three to analyse.
To understand why the 30d availability was going above 100%, what I did was starting from looking at the PromQL expression, and analyse it.
The following is the expression (in Jsonnet) for the write 30-day availability
1 - (
(
# too slow
sum by (cluster) (cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d{verb="write"})
-
sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{verb="write",le="1"})
)
+
# errors
sum by (cluster) (code:apiserver_request_total:increase%(SLODays)s{verb="write",code=~"5.."} or vector(0))
)
/
sum by (cluster) (code:apiserver_request_total:increase%(SLODays)s{verb="write"})
The expression terms can be refactored as follows
1 - ((all_writes - fast_enough_writes) + write_errors) / write_requests_total
=
1 - (slow_writes+write_errors) / write_requests_total
=
1 - unavailable_writes_percent