Improving Kubernetes-Mixin API Server Rules Consistency

Preamble

A very interesting and widely used project in the Kubernetes space is kubernetes-mixin. The project, written in Jsonnet, provides a well-tested and customisable set of Prometheus rules, alerts, and Grafana dashboards.

An important set of rules are the ones regarding the Kubernetes API Server, providing valuable insights, such as the server’s availability, error budget, workqueues length and more.

I recently made a small contribution to that project, solving an issue I had been seeing for a while. Solving the issue forced me to deep-dive into the Prometheus rules details, and I am writing this to share what I learned.

The Problem

What I observed happening in some very large clusters, was that the 30-day Availability and error budget from time to time went (slightly) above 100%.

Github Issue

https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/975?ref=https://coder.social). </aside>

Root Cause Analysis

The problem happened sporadically, especially on large and heavily used clusters.

The kubernetes-mixin splits the API server availability into three views

Write availability (verb=”write”), i.e. requests with the verb label matching selector verb=~"POST|PUT|PATCH|DELETE"
Read availability (verb=”read”), i.e. requests with the verb label matching selector verb=~"LIST|GET"
Overall (verb=”all”) availability.

Maths Come to Rescue

The problem affected all three views of the availability, but in the following I will refer only to the write one, for the sake of simplicity, since it has the easiest expression of the three to analyse.

To understand why the 30d availability was going above 100%, what I did was starting from looking at the PromQL expression, and analyse it.

The following is the expression (in Jsonnet) for the write 30-day availability

1 - (
  (
    # too slow
    sum by (cluster) (cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d{verb="write"})
    -
    sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{verb="write",le="1"})
  )
  +
  # errors
  sum by (cluster) (code:apiserver_request_total:increase%(SLODays)s{verb="write",code=~"5.."} or vector(0))
)
/
sum by (cluster) (code:apiserver_request_total:increase%(SLODays)s{verb="write"})

The expression terms can be refactored as follows

1 - ((all_writes - fast_enough_writes) + write_errors) / write_requests_total
=
1 - (slow_writes+write_errors) / write_requests_total
=
1 - unavailable_writes_percent