Use of VictoriaMetrics Clustering Mode

Earlier we described the use of the single-node version of VictoriaMetrics. For ingestion rates below one million data points per second, it is recommended that the single-node version be used instead of the clustered version. The single-node version scales with the number of CPU cores, RAM, and available storage space. The single-node version is easier to configure and operate than the clustered version, so think twice before using the clustered version. We covered the basic use of the single-node version of the VM above, so let’s move on to how to use the clustered version.

Cluster version main features.

Supports all features of the single-node version.
Performance and capacity level scaling.
Support for multiple independent namespaces for time series data (multi-tenancy).
Support for multiple copies.

Components

Earlier we learned about the basic architecture of VMs, for the clustered mode the main services include the following.

vmstorage: stores raw data and returns query data for the given time range for the specified tag filter. When the directory pointed to by -storageDataPath contains less than -storage.minFreeDiskSpaceBytes, the vmstorage node will automatically switch to read-only mode and the vminsert node will also stop sending data to such nodes and start rerouting data to the remaining vmstorage nodes.
vminsert: accepts the ingested data and stores it scattered to the vmstorage node based on a consistent hash of the metric name and all its labels.
vmselect: Executes a query by fetching the required data from all configured vmstorage nodes.

Each service can scale independently, and the vmstorage nodes do not know or communicate with each other and do not share any data. This increases the availability of the cluster and simplifies cluster maintenance and scaling. The minimum cluster must contain the following nodes.

A single vmstorage node with -retentionPeriod and -storageDataPath arguments
A single vminsert node with -storageNode=<vmstorage_host>
single vmselect node with -storageNode=<vmstorage_host>

However, we recommend running at least two nodes for each service component for high availability, so that the cluster will continue to work when a single node is temporarily unavailable and the remaining nodes can still handle the increased workload. If you have a larger cluster, then you can run multiple smaller vmstorage nodes, as this will reduce the increased workload on the remaining vmstorage nodes when some vmstorage nodes are temporarily unavailable.

The individual services can be configured by means of environment variables in addition to the parameter flags as follows.

The -envflag.enable flag must be set
The . must be replaced with _, for example -insert.maxQueueDuration <duration> can be converted to insert_maxQueueDuration=<duration>
For duplicate flags, another syntax can be used to concatenate different values into one by using , as a separator, e.g. -storageNode <nodeA> -storageNode <nodeB> would be converted to -storageNode=<nodeA>,<nodeB>
You can use -envflag.prefix to prefix environment variables, e.g. if you set -envflag.prefix=VM*, then the environment variable parameter must start with VM*

Multi-tenancy

VM clusters also support multiple independent tenants (also called namespaces), which are identified by accountID or accountID:projectID, which are placed in the requested urls.

Each accountID and projectID is identified by an arbitrary value in the range [0 … Additional information about the tenant, such as authentication tokens, tenant name, limits, billing, etc., is stored in a separate relational database. This database must be managed by a separate service located in front of the VictoriaMetrics cluster, such as vmauth or vmgateway.
Tenants are created automatically when the first data point is written to the specified tenant.
Data for all tenants is evenly distributed across the available vmstorage nodes, which ensures even load across vmstorage nodes when different tenants have different amounts of data and different query loads.
Database performance and resource usage does not depend on the number of tenants, it depends mainly on the total number of active time series in all tenants. A time series is considered active if it has received at least one sample in the past hour or has been queried in the past hour.
VictoriaMetrics does not support querying multiple tenants in a single request.

Cluster Scaling

The performance and capacity of a VM cluster can be scaled in two ways.

By adding more resources (CPU, RAM, disk IO, disk space, network bandwidth) to the existing nodes in the cluster, also called vertical scalability.
By adding more nodes to the cluster, also called horizontal scalability.

There are some general recommendations for cluster scaling.

Adding more CPU and memory to existing vmselect nodes can improve the performance of complex queries that can handle a large number of time series and a large number of raw samples.
Adding more vmstorage nodes increases the number of active time series that the cluster can handle, which also improves query performance for time series with high churn rate ( churn rate). Cluster stability also increases with the number of vmstorage nodes, and when some vmstorage nodes are unavailable, the active vmstorage nodes need to handle a lower additional workload.
Adding more CPU and memory to existing vmstorage nodes increases the number of active time series that the cluster can handle. Adding more vmstorage nodes is preferable to adding more CPUs and memory to existing vmstorage nodes because more vmstorage nodes can improve cluster stability and query performance for time series with high churn rates.
Adding more vminsert nodes will improve the maximum speed of data ingestion because the ingested data can be split between more vminsert nodes.
Adding more vmselect nodes will improve the maximum speed of querying, as incoming concurrent requests may be split between more vmselect nodes.

Cluster Availability

The HTTP load balancer needs to stop routing requests to unavailable vminsert and vmselect nodes.
If at least one vmstorage node exists, the cluster is still available.
- vminsert reroutes incoming data from an unavailable vmstorage node to a healthy vmstorage node
- If at least one vmstorage node is available, then vmselect continues to provide partial responses. Pass the -search.denyPartialResponse flag to vmselect or the deny_partial_response=1 query parameter in the request to vmselect if consistency of availability is a priority.

De-duplication

If the -dedup.minScrapeInterval command line flag is set to a time greater than 0, VictoriaMetrics will remove duplicate data points. For example, -dedup.minScrapeInterval=60s will deduplicate data points on the same time series, and if they are in the same discrete 60-second storage bucket, the earliest data point will be retained. In the case of equal timestamps, any data point will be retained.

The recommended value for -dedup.minScrapeInterval is equal to the value of scrape_interval in the Prometheus configuration and it is recommended to use one scrape_interval configuration for all crawl targets.

Deduplication reduces disk space usage if multiple identically configured vmagent or Prometheus instances in HA write data to the same VictoriaMetrics instance. These vmagent or Prometheus instances must have the same external_labels section in their configurations, so they write data to the same time series.

Capacity Planning

Cluster capacity scales linearly with available resources, and the amount of CPU and memory required per node type depends heavily on the workload - number of active time series, series churn rate, query type, query QPS, etc. It is recommended to deploy a test VictoriaMetrics cluster for your production workload and iteratively tune the resources per node and the number of nodes per node type until the cluster becomes stable. It is also recommended to set up monitoring for the cluster to help identify bottlenecks in the cluster setup.

Specifying the required storage space to be reserved (which can be set via the -retentionPeriod command line flag in vmstorage) can be inferred from the disk space usage in a test run. For example, if the storage usage after a day’s run on a production workload is 10GB, it will require at least 10GB*100=1TB of disk space for -retentionPeriod=100d (100-day retention period). Storage usage can be monitored using the official Grafana dashboard for VictoriaMetrics clusters (https://grafana.com/grafana/dashboards/11176).

It is recommended that the following amount of spare resources be set aside.

50% of free memory in all node types to reduce the probability of crashing due to OOM during temporary spikes in workload.
50% free CPU across all node types to reduce the probability of slowdowns during temporary spikes in workload.
At least 30% of free storage in the directory pointed to by the -storageDataPath command line flag on the -vmstorage node.

Some capacity planning tips for VictoriaMetrics clusters.

Copy sets increase the amount of resources required by the cluster by up to N times, where N is the replication factor.
Cluster capacity can be increased by adding more vmstorage nodes and/or by increasing the memory and CPU resources per vmstorage node for active time series.
Query latency can be reduced by increasing the number of vmstorage nodes and/or by increasing the memory and CPU resources per vmselect node.
The total number of CPU cores required for all vminsert nodes can be calculated by the ingestion rate: CPUs = ingestion_rate / 100K .
The -rpc.disableCompression command line flag on the vminsert node increases the ingestion capacity at the cost of higher network bandwidth usage between vminsert and vmstorage.

Copy and data security

By default, VictoriaMetrics data replication relies on the underlying storage pointed to by -storageDataPath for this purpose.

However, we can also manually enable replication by passing the -replicationFactor=N command parameter to vminsert, which ensures that if up to N-1 vmstorage nodes are unavailable, all data is still available for querying. The cluster must contain at least 2*N-1 vmstorage nodes, where N is the replication factor to maintain the specified replication factor for newly ingested data in the event that N-1 storage nodes are lost.

For example, when -replicationFactor=3 is passed to vminsert, it copies all ingested data to 3 different vmstorage nodes, so up to 2 vmstorage nodes can be lost without losing data. The minimum number of vmstorage nodes should be equal to 2*3-1 = 5, so when 2 vmstorage nodes are lost, the remaining 3 vmstorage nodes can serve the newly ingested data.

When replication is enabled, the -dedup.minScrapeInterval=1ms command line flag must be passed to the vmselect node. When up to N-1 vmstorage nodes are slow and/or temporarily unavailable, the optional -replicationFactor=N parameter can be passed to vmselect to improve query performance.

This is because vmselect does not wait for responses from up to N-1 vmstorage nodes. Sometimes the -replicationFactor on a vmselect node may result in a partial response. -dedup.minScrapeInterval=1ms Duplicate replicated data is removed during a query.

If deduplicated data is pushed to VictoriaMetrics from identically configured vmagent instances or Prometheus instances, -dedup.minScrapeInterval must be set to a larger value based on the deduplication documentation.

Please note that replication does not save from disaster, so it is recommended to perform regular backups. Also replication increases resource usage - CPU, memory, disk space, network bandwidth - by up to -replicationFactor times. So it can be guaranteed by moving replication to the underlying storage pointed to by -storageDataPath, such as Google Compute Engine permanent disks, which protect against data loss and data corruption, and which also provide consistently high performance and can be resized without downtime. For most use cases, HDD-based persistent disks should be sufficient.

Backups

It is recommended to perform regular backups from instant snapshots to prevent errors such as accidental data deletion. The following steps must be performed for each vmstorage node to create a backup.

An instant snapshot can be created by accessing the /snapshot/create HTTP handler, which will create the snapshot and return its name.
Use the vmbackup component to archive the created snapshot from the <-storageDataPath>/snapshots/<snapshot_name> folder. The archiving process does not interfere with vmstorage work, so it can be performed at any suitable time.
Delete unused snapshots via /snapshot/delete?snapshot=<snapshot_name> or /snapshot/delete_all to free up occupied storage space.
No need to synchronize backups between all vmstorage nodes.

To restore from a backup.

Stop the vmstorage node using kill -INT.
Use the vmrestore component to restore data from the backup to the -storageDataPath directory.
Start the vmstorage node.

After understanding some configuration details of VM clustering, let’s start deploying the VM cluster next.

Deployment

If you already know the VM components well, then a one-click installation using Helm Chart is recommended.

☸ ➜ helm repo add vm https://victoriametrics.github.io/helm-charts/
☸ ➜ helm repo update
# 导出默认的 values 值到 values.yaml 文件中
☸ ➜ helm show values vm/victoria-metrics-cluster > values.yaml
# 根据自己的需求修改 values.yaml 文件配置
# 执行下面的命令进行一键安装
☸ ➜ helm install victoria-metrics vm/victoria-metrics-cluster -f values.yaml -n NAMESPACE
# 获取 vm 运行的 pods 列表
☸ ➜ kubectl get pods -A | grep 'victoria-metrics'

The reason we choose to deploy manually here is to get more details about each component.

Since the vmstorage component is stateless, we first deploy it using StatefulSet, and since the component is also scalable, we first deploy two copies, with the following list of resources.

# cluster-vmstorage.yaml
apiVersion: v1
kind: Service
metadata:
  name: cluster-vmstorage
  namespace: kube-vm
  labels:
    app: vmstorage
spec:
  clusterIP: None
  ports:
    - port: 8482
      targetPort: http
      name: http
    - port: 8401
      targetPort: vmselect
      name: vmselect
    - port: 8400
      targetPort: vminsert
      name: vminsert
  selector:
    app: vmstorage
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: vmstorage
  namespace: kube-vm
  labels:
    app: vmstorage
spec:
  serviceName: cluster-vmstorage
  selector:
    matchLabels:
      app: vmstorage
  replicas: 2
  podManagementPolicy: OrderedReady
  template:
    metadata:
      labels:
        app: vmstorage
    spec:
      containers:
        - name: vmstorage
          image: "victoriametrics/vmstorage:v1.77.0-cluster"
          imagePullPolicy: "IfNotPresent"
          args:
            - "--retentionPeriod=1"
            - "--storageDataPath=/storage"
            - --envflag.enable=true
            - --envflag.prefix=VM_
            - --loggerFormat=json
          ports:
            - name: http
              containerPort: 8482
            - name: vminsert
              containerPort: 8400
            - name: vmselect
              containerPort: 8401
          livenessProbe:
            failureThreshold: 10
            initialDelaySeconds: 30
            periodSeconds: 30
            tcpSocket:
              port: http
            timeoutSeconds: 5
          readinessProbe:
            failureThreshold: 3
            initialDelaySeconds: 5
            periodSeconds: 15
            timeoutSeconds: 5
            httpGet:
              path: /health
              port: http
          volumeMounts:
            - name: storage
              mountPath: /storage
  volumeClaimTemplates:
    - metadata:
        name: storage
      spec:
        storageClassName: longhorn
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: "2Gi"

First of all, you need to create a Headless Service, because the later components need to access to each specific Pod. In the vmstorage startup parameters, you specify the indicator data retention duration through the --retentionPeriod parameter, 1 means one month, which is also the default duration, and then you specify the data storage path through the -- storageDataPath parameter specifies the data storage path, remember to persist the directory.

Again, just apply this resource directly.

☸ ➜ kubectl apply -f https://p8s.io/docs/victoriametrics/manifests/cluster-vmstorage.yaml
☸ ➜ kubectl get pods -n kube-vm -l app=vmstorage
NAME          READY   STATUS    RESTARTS   AGE
vmstorage-0   1/1     Running   0          5m40s
vmstorage-1   1/1     Running   0          3m31s
☸ ➜ kubectl get svc -n kube-vm -l app=vmstorage
NAME                TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                      AGE
cluster-vmstorage   ClusterIP   None         <none>        8482/TCP,8401/TCP,8400/TCP   5m46s

Next, we can deploy the vmselect component, which is stateless and can be managed directly using Deployment, with the corresponding resource list file shown below.

# cluster-vmselect.yaml
apiVersion: v1
kind: Service
metadata:
  name: vmselect
  namespace: kube-vm
  labels:
    app: vmselect
spec:
  ports:
    - name: http
      port: 8481
      targetPort: http
  selector:
    app: vmselect
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vmselect
  namespace: kube-vm
  labels:
    app: vmselect
spec:
  selector:
    matchLabels:
      app: vmselect
  template:
    metadata:
      labels:
        app: vmselect
    spec:
      containers:
        - name: vmselect
          image: "victoriametrics/vmselect:v1.77.0-cluster"
          imagePullPolicy: "IfNotPresent"
          args:
            - "--cacheDataPath=/cache"
            - --storageNode=vmstorage-0.cluster-vmstorage.kube-vm.svc.cluster.local:8401
            - --storageNode=vmstorage-1.cluster-vmstorage.kube-vm.svc.cluster.local:8401
            - --envflag.enable=true
            - --envflag.prefix=VM_
            - --loggerFormat=json
          ports:
            - name: http
              containerPort: 8481
          readinessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 5
            periodSeconds: 15
            timeoutSeconds: 5
            failureThreshold: 3
          livenessProbe:
            tcpSocket:
              port: http
            initialDelaySeconds: 5
            periodSeconds: 15
            timeoutSeconds: 5
            failureThreshold: 3
          volumeMounts:
            - mountPath: /cache
              name: cache-volume
      volumes:
        - name: cache-volume
          emptyDir: {}

The most important part of this is to specify all vmstorage node addresses via the --storageNode parameter, which we used above for the StatefulSet deployment, so it can be accessed directly using the FQDN form. To apply the above object directly.

☸ ➜ kubectl apply -f https://p8s.io/docs/victoriametrics/manifests/cluster-vmselect.yaml
☸ ➜ kubectl get pods -n kube-vm -l app=vmselect
NAME                       READY   STATUS    RESTARTS   AGE
vmselect-bcb54965f-5rkml   1/1     Running   0          2m4s
☸ ➜ kubectl get svc -n kube-vm -l app=vmselect
NAME       TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
vmselect   ClusterIP   10.107.227.214   <none>        8481/TCP   2m17s

If we want to do a query, then we can just expose the vmselect service directly to the public by modifying the Grafana data source address to http://<select-service>/select/0/prometheus/ .

prometheus

Then you need to deploy the vminsert component to receive the indicator data inserts, which is also stateless, and most importantly, you need to specify all vmstorage nodes with the --storageNode parameter.

# cluster-vminsert.yaml
apiVersion: v1
kind: Service
metadata:
  name: vminsert
  namespace: kube-vm
  labels:
    app: vminsert
spec:
  ports:
    - name: http
      port: 8480
      targetPort: http
  selector:
    app: vminsert
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vminsert
  namespace: kube-vm
  labels:
    app: vminsert
spec:
  selector:
    matchLabels:
      app: vminsert
  template:
    metadata:
      labels:
        app: vminsert
    spec:
      containers:
        - name: vminsert
          image: "victoriametrics/vminsert:v1.77.0-cluster"
          imagePullPolicy: "IfNotPresent"
          args:
            # - -replicationFactor=2 # 可以开启数据副本
            - --storageNode=vmstorage-0.cluster-vmstorage.kube-vm.svc.cluster.local:8400
            - --storageNode=vmstorage-1.cluster-vmstorage.kube-vm.svc.cluster.local:8400
            - --envflag.enable=true
            - --envflag.prefix=VM_
            - --loggerFormat=json
          ports:
            - name: http
              containerPort: 8480
          readinessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 5
            periodSeconds: 15
            timeoutSeconds: 5
            failureThreshold: 3
          livenessProbe:
            tcpSocket:
              port: http
            initialDelaySeconds: 5
            periodSeconds: 15
            timeoutSeconds: 5
            failureThreshold: 3

Since it is inherently stateless, the number of replicas can be increased as needed, and the HPA can be configured for automatic expansion and contraction. Apply the resource list above directly.

☸ ➜ kubectl apply -f https://p8s.io/docs/victoriametrics/manifests/cluster-vminsert.yaml
☸ ➜ kubectl get pods -n kube-vm -l app=vminsert
NAME                        READY   STATUS    RESTARTS   AGE
vminsert-66c88cd497-l64ps   1/1     Running   0          2m27s
☸ ➜ kubectl get svc -n kube-vm -l app=vminsert
NAME       TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
vminsert   ClusterIP   10.96.125.134   <none>        8480/TCP   70s

After the deployment of the cluster mode components, we can also configure Prometheus in front of us to write its data to the VM remotely by modifying the address of remote_write to http://vminsert:8480/insert/0/prometheus/, which is different from the API of single-node mode. The path is different from the single-node mode API, as shown below.

# vm-prom-config3.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: kube-vm
data:
  prometheus.yaml: |
    global:
      scrape_interval: 15s
      scrape_timeout: 15s
    remote_write:    # 写入到远程 VM 存储，url 是远程写入接口地址
    - url: http://vminsert:8480/insert/0/prometheus/
      # queue_config:    # 如果 Prometheus 抓取指标很大，可以加调整 queue，但是会提高内存占用
      #   max_samples_per_send: 10000  # 每次发送的最大样本数
      #   capacity: 20000  
      #   max_shards: 30   # 最大分片数，即并发量。
    scrape_configs:
    - job_name: "nodes"
      static_configs:
      - targets: ['192.168.0.109:9111', '192.168.0.110:9111', '192.168.0.111:9111']
      relabel_configs: # 通过 relabeling 从 __address__ 中提取 IP 信息，为了后面验证 VM 是否兼容 relabeling
      - source_labels: [__address__]
        regex: "(.*):(.*)"
        replacement: "${1}"
        target_label: 'ip'
        action: replace

Update the Prometheus configuration and start Prometheus, the previous VMs in standalone mode can be stopped first.

1
2
3

☸ ➜ kubectl apply -f https://p8s.io/docs/victoriametrics/manifests/vm-prom-config3.yaml
☸ ➜ kubectl scale deploy victoria-metrics --replicas=0 -n kube-vm
☸ ➜ kubectl scale deploy prometheus --replicas=1 -n kube-vm

After successful configuration, normal data can start to be written to vmstorage. Checking the vmstorage logs, you can see that the partition was successfully created, which proves that the data is now being received.

☸ ➜ kubectl logs -f vmstorage-0 -n kube-vm
......
{"ts":"2022-05-06T08:35:15.786Z","level":"info","caller":"VictoriaMetrics/lib/storage/partition.go:206","msg":"creating a partition \"2022_05\" with smallPartsPath=\"/storage/data/small/2022_05\", bigPartsPath=\"/storage/data/big/2022_05\""}
{"ts":"2022-05-06T08:35:15.802Z","level":"info","caller":"VictoriaMetrics/lib/storage/partition.go:222","msg":"partition \"2022_05\" has been created"}

Then you can go to Grafana and recheck if the Dashboard is working.

Dashboard

If you need to add a new vmstorage node now, then you need to follow these steps.

Start the new vmstorage node using the same -retentionPeriod configuration as the existing nodes in the cluster.
Restart all vmselect nodes incrementally, adding a new -storageNode parameter containing <new_vmstorage_host>.
Restart all vminsert nodes incrementally, adding new -storageNode arguments containing <new_vmstorage_host>.

Table of Contents