ETCD 出现高碎片率事件解析

集群频繁触发 etcdDatabaseHighFragmentationRatio 告警, PrometheusRule 内容如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
- alert: etcdDatabaseHighFragmentationRatio
annotations:
description: 'etcd cluster "{{ $labels.job }}": database size in use on instance
{{ $labels.instance }} is {{ $value | humanizePercentage }} of the actual
allocated disk space, please run defragmentation (e.g. etcdctl defrag) to
retrieve the unused fragmented disk space.'
runbook_url: https://etcd.io/docs/v3.5/op-guide/maintenance/#defragmentation
summary: etcd database size in use is less than 50% of the actual allocated
storage.
expr: (last_over_time(etcd_mvcc_db_total_size_in_use_in_bytes{job=~".*etcd.*"}[5m])
/ last_over_time(etcd_mvcc_db_total_size_in_bytes{job=~".*etcd.*"}[5m])) <
0.5 and etcd_mvcc_db_total_size_in_use_in_bytes{job=~".*etcd.*"} > 104857600
for: 10m
labels:
severity: warning
Read more

etcd leader选举

etcd 是基于 raft 算法进行选举,而 raft 是一种管理日志一致性的协议,将系统中的角色分为三个

  1. leader: 接受客户端的请求,并向 follower 发送同步请求日志
  2. follower: 接收 leader 同步的日志
  3. candidate: 候选者角色,在选举过程中发挥作用
Read more