As the hottest cloud-native monitoring tool, there is no doubt that Prometheus is a great performer. However, as we use Prometheus, the monitoring metrics stored in Prometheus become more and more frequent over time, and the frequency of queries increases, as we add more Dashboards with Grafana, we may slowly experience that Grafana is no longer able to render charts on time, and occasionally timeouts occur, especially when we are aggregating a large amount of metrics over a long period of time. Especially when we are aggregating a large amount of metrics over a long period of time, Prometheus queries may time out even more, which requires a mechanism similar to background batch processing to complete the computation of these complex operations in the background, so that the user only needs to query the results of these operations.
Prometheus provides a Recording Rule to support this kind of background computation, which can optimize the performance of PromQL statements for complex queries and improve query efficiency.
Problems
Let’s say we want to know the actual CPU and memory utilization between Kubernetes nodes, we can query the CPU and memory utilization by using the metrics container_cpu_usage_seconds_total
and container_memory_usage_bytes
. Because each running container collects these two metrics, it is important to know that for slightly larger online environments where we may have thousands of containers running at the same time, for example, Prometheus will have a harder time querying data quickly when we are now querying data for thousands of containers every 5 minutes for the next week.
Let’s say we divide the total number of container_cpu_usage_seconds_total
by the total number of kube_node_status_allocatable_cpu_cores
to get the CPU utilization.
Use the scrolling window to calculate memory utilization by dividing the total number of container_memory_usage_bytes
by the total number of kube_node_status_allocatable_memory_bytes
.
Recording Rules
We said that Prometheus provides a way to optimize our query statements called Recording Rule. The basic idea of the Recording Rule is that it allows us to create custom meta-time sequences based on other time series, and if you use the Prometheus Operator you will find a number of such rules already in Prometheus, such as
|
|
These two rules above would be perfectly fine for executing our query above, they would be executed continuously and store the results in a very small time series. sum(rate(container_cpu_usage_seconds_total{job="kubelet", image!="", container_name!=""}[5m])) by (namespace)
will be evaluated at predefined time intervals and stored as a new metric : namespace:container_cpu_usage_seconds_total:sum_rate
, the same as the in-memory query.
Now, I can change the query to derive the CPU utilization as follows.
Now, it runs 14 times faster!
In the same way memory utilization is calculated.
Now runs 27 times faster!
Recoding rule usage
In the Prometheus configuration file, we can define the access path to the recoding rule
rule file via rule_files
, in much the same way as we define an alarm rule.
Each rule file is defined by the following format.
A simple rules file might look like this.
The specific configuration items for rule_group are shown below.
Consistent with alert rules, a group can contain multiple rule rules.
As defined in the rules, Prometheus completes the calculation of the PromQL expressions defined in expr
in the background and saves the results in a new time series record
, with the possibility to add additional labels to these samples via the labels tag.
The frequency of these rule files is by default the same as that of the alarm rules, which are defined by global.evaluation_interval
: