ETCD 出现高碎片率事件解析

集群频繁触发 etcdDatabaseHighFragmentationRatio 告警, PrometheusRule 内容如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
- alert: etcdDatabaseHighFragmentationRatio
annotations:
description: 'etcd cluster "{{ $labels.job }}": database size in use on instance
{{ $labels.instance }} is {{ $value | humanizePercentage }} of the actual
allocated disk space, please run defragmentation (e.g. etcdctl defrag) to
retrieve the unused fragmented disk space.'
runbook_url: https://etcd.io/docs/v3.5/op-guide/maintenance/#defragmentation
summary: etcd database size in use is less than 50% of the actual allocated
storage.
expr: (last_over_time(etcd_mvcc_db_total_size_in_use_in_bytes{job=~".*etcd.*"}[5m])
/ last_over_time(etcd_mvcc_db_total_size_in_bytes{job=~".*etcd.*"}[5m])) <
0.5 and etcd_mvcc_db_total_size_in_use_in_bytes{job=~".*etcd.*"} > 104857600
for: 10m
labels:
severity: warning

相关指标:

  1. etcd_server_quota_backend_bytes:当前后端存储配额大小(字节),默认为 2GB
  2. etcd_mvcc_db_total_size_in_bytes:物理分配的底层数据库总大小(字节),包含了数据(如 keyspace)和碎片,即 DB SIZE
  3. etcd_mvcc_db_total_size_in_use_in_bytes:逻辑上正在使用的底层数据库的总大小(以字节为单位),不包含碎片

也就是说 quota-backend-bytes 配置后,etcd_mvcc_db_total_size_in_bytes 的大小并不会根据这个值而变化,会变的是 etcd_server_quota_backend_bytes,etcd_mvcc_db_total_size_in_bytes 的值指的是 DB SIZE,可以通过以下方式获取 DB SIZE

1
2
ls -lrth ${etcd-data-dir}/member/snap
etcdctl endpoint status -w table

为什么会产生碎片?

  1. ETCD 支持多版本并发控制(MVCC),同时会精确记录其 keyspace 的历史
  2. 压缩操作是清除历史记录的唯一方法,通常用 –auto-compaction-mode 和 –auto-compaction-retention 来实现自动压缩
  3. 但压缩操作后的空闲空间并不会真正在文件系统中释放,而是会被 ETCD 标记为可使用的空闲空间,也就是说压缩操作后仍然会占用磁盘空间
  4. 要真正释放,就需要进行碎片整理,即 etcdctl defrag
1
etcdctl defrag

那么针对 etcdDatabaseHighFragmentationRatio 告警的触发,要怎么判断需不需要进行碎片清理?

  1. 通过 etcd_server_quota_backend_bytes 指标查看实际配额
  2. 通过指标 etcd_mvcc_db_total_size_in_bytes 或者命令检查 DB SIZE,看是否真的很大且接近于 quota-backend-bytes,如果不是则无需担心
  3. 如果是,获取每种资源的数量,查看是什么资源导致 DB Size 这么大,然后通过碎片清理尝试释放空间,如果释放后仍接近于 quota-backend-bytes,那么需要考虑增加配额
    1
    2
    etcdctl get /registry --prefix --keys-only | grep -v ^$ | awk -F '/'  '{ h[$3]++ } END {for (k in h) print h[k], k}' | sort -nr
    etcdctl defrag
Author

Warner Chen

Posted on

2024-09-14

Updated on

2024-09-14

Licensed under

You need to set install_url to use ShareThis. Please set it in _config.yml.
You forgot to set the business or currency_code for Paypal. Please set it in _config.yml.

Comments

You forgot to set the shortname for Disqus. Please set it in _config.yml.