节点根目录被打满导致的ETCD憨批修复记录

背景

事情发生在 UAT 环境的其中一台 Controller 节点,节点根目录被打满,同时 etcd 数据没有落盘到独立的磁盘中,导致 etcd 憨批,节点出现 notready

etcd-error-log

修复过程

参考了各种网络资料,最终形成如下修复手段:

  1. 移除 statis pod yaml,从而停止坏掉的 etcd pod
  2. 通过 etcdctl member remove 移除坏掉的 etcd 实例
  3. 备份数据目录并移除
  4. 通过 etcdctl member add 添加新实例,记录 etcdctl 输出的配置信息
  5. 通过裸起容器的方式,启动 etcd 容器,启动需要用到的参数,参考 statis pod yaml 和第 4 步输出的配置信息
  6. 启动后会与 leader 进行数据的同步,可以通过 etcdctl endpoint status -w table 查看状态
  7. 如果同步成功则可以停止 etcd 容器,将 statis pod yaml 放回对应的目录中,集群修复

具体的操作命令:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# stop issue etcd pod
mv /etc/kubernetes/manifests/etcd.yaml .

# init etcdctl command envs
export endpoints="https://10.82.69.10:2379,https://10.82.69.11:2379,https://10.82.69.12:2379,https://10.82.69.19:2379,https://10.66.10.83:2379"
export cacert="/etc/kubernetes/pki/etcd/ca.crt"
export cert="/etc/kubernetes/pki/etcd/peer.crt"
export key="/etc/kubernetes/pki/etcd/peer.key"

# sample: e member list -w table
alias e="etcdctl --endpoints $endpoints --cacert $cacert --cert $cert --key $key"

# or use this one
eval $(kubectl get nodes -owide|grep -E "etcd|control-plane" |awk '{printf "https://"$6":2379,"}'|awk '{gsub(",$","");print "export ETCDCTL_ENDPOINTS=\""$1"\""}') && export ETCDCTL_CACERT=/etc/kubernetes/ssl/etcd/ca.crt && export ETCDCTL_CERT=/etc/kubernetes/ssl/etcd/peer.crt && export ETCDCTL_KEY=/etc/kubernetes/ssl/etcd/peer.key

# remove issue etcd member
etcdctl member remove $issue_etcd_id

# delete etcd data
rm -rf /var/lib/etcd/*

# member add
etcdctl member add wcn-gduvm-mwdcm1 --peer-urls=https://10.82.69.10:2380

# start a temporary etcd pod to restore
nerdctl run -d --name restore_etcd \
-v /etc/kubernetes/ssl/etcd:/etc/kubernetes/ssl/etcd \
-v /var/lib/etcd:/var/lib/etcd \
--network=host \
-e ETCD_NAME="wcn-gduvm-mwdcm1" \
-e ETCD_INITIAL_CLUSTER="wcn-gduvm-mwdcm2=https://10.82.69.11:2380,wcn-gduvm-mwdcm1=https://10.82.69.10:2380,wcn-gduvm-mwdcm3=https://10.82.69.12:2380" \
-e ETCD_INITIAL_ADVERTISE_PEER_URLS="https://10.82.69.10:2380" \
-e ETCD_INITIAL_CLUSTER_STATE="existing" \
--entrypoint=etcd 10.82.49.238/quay.io/coreos/etcd:v3.5.6 --advertise-client-urls=https://10.82.69.10:2379 --auto-compaction-retention=8 --cert-file=/etc/kubernetes/ssl/etcd/server.crt --client-cert-auth=true --data-dir=/var/lib/etcd --election-timeout=5000 --experimental-initial-corrupt-check=true --experimental-watch-progress-notify-interval=5s --heartbeat-interval=250 --key-file=/etc/kubernetes/ssl/etcd/server.key --listen-client-urls=https://127.0.0.1:2379,https://10.82.69.10:2379 --listen-metrics-urls=http://127.0.0.1:2381 --listen-peer-urls=https://10.82.69.10:2380 --metrics=basic --peer-cert-file=/etc/kubernetes/ssl/etcd/peer.crt --peer-client-cert-auth=true --peer-key-file=/etc/kubernetes/ssl/etcd/peer.key --peer-trusted-ca-file=/etc/kubernetes/ssl/etcd/ca.crt --snapshot-count=10000 --trusted-ca-file=/etc/kubernetes/ssl/etcd/ca.crt

# wait for the etcd pod running, if use kubectl and etcdctl to see that both node and member are restored, we can stop it
nerdctl stop restore_etcd

# start etcd pod
mv ./etcd.yaml /etc/kubernetes/manifests/etcd.yaml
Author

Warner Chen

Posted on

2024-06-29

Updated on

2024-06-29

Licensed under

You need to set install_url to use ShareThis. Please set it in _config.yml.
You forgot to set the business or currency_code for Paypal. Please set it in _config.yml.

Comments

You forgot to set the shortname for Disqus. Please set it in _config.yml.