RKE1 ETCD 出现 request cluster id mismatch 问题修复记录

当集群中三个 Control Plane 节点的 ETCD 出现 request cluster ID mismatch 问题时,可以保留一个 ETCD 实例通过 --force-new-cluster 参数重建集群,然后再将其他两个节点的 ETCD 实例加入集群。

通过 docker rename 的方式保留第二/三台 Control Plane 节点的 ETCD

1
2
docker stop etcd
docker rename etcd etcd-old

备份第一台 Control Plane 节点的 ETCD 启动命令

1
2
3
4
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock assaflavie/runlike:latest etcd

# 以下为 ETCD 启动命令
docker run --name=etcd --hostname=test001 --env=ETCDCTL_API=3 --env=ETCDCTL_CACERT=/etc/kubernetes/ssl/kube-ca.pem --env=ETCDCTL_CERT=/etc/kubernetes/ssl/kube-etcd-172-16-0-106.pem --env=ETCDCTL_KEY=/etc/kubernetes/ssl/kube-etcd-172-16-0-106-key.pem --env=ETCDCTL_ENDPOINTS=https://127.0.0.1:2379 --env=ETCD_UNSUPPORTED_ARCH=x86_64 --volume=/var/lib/etcd:/var/lib/rancher/etcd/:z --volume=/etc/kubernetes:/etc/kubernetes:z --network=host --restart=always --label='io.rancher.rke.container.name=etcd' --runtime=runc --detach=true registry.cn-hangzhou.aliyuncs.com/rancher/mirrored-coreos-etcd:v3.4.15-rancher1 /usr/local/bin/etcd --listen-peer-urls=https://0.0.0.0:2380 --trusted-ca-file=/etc/kubernetes/ssl/kube-ca.pem --peer-trusted-ca-file=/etc/kubernetes/ssl/kube-ca.pem --key-file=/etc/kubernetes/ssl/kube-etcd-172-16-0-106-key.pem --peer-cert-file=/etc/kubernetes/ssl/kube-etcd-172-16-0-106.pem --peer-key-file=/etc/kubernetes/ssl/kube-etcd-172-16-0-106-key.pem --peer-client-cert-auth=true --initial-advertise-peer-urls=https://172.16.0.106:2380 --heartbeat-interval=500 --cert-file=/etc/kubernetes/ssl/kube-etcd-172-16-0-106.pem --advertise-client-urls=https://172.16.0.106:2379 --cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 --initial-cluster=etcd-rke1-server-0=https://172.16.0.106:2380,etcd-rke1-server-1=https://172.16.0.105:2380,etcd-rke1-server-2=https://172.16.0.104:2380 --initial-cluster-state=new --client-cert-auth=true --listen-client-urls=https://0.0.0.0:2379 --initial-cluster-token=etcd-cluster-1 --name=etcd-rke1-server-0 --enable-v2=true --election-timeout=5000 --data-dir=/var/lib/rancher/etcd/

停止第一台 Control Plane 节点的 ETCD

1
2
docker stop etcd
docker rename etcd etcd-old

修改先前保存的 ETCD 启动命令,在 initial-cluster 参数中删除第二/三台 Control Plane 节点的 ETCD 信息,并在最后添加 --force-new-cluster 参数,然后执行,如果启动后仍然报 request cluster ID mismatch 的错误,可以重复多几次

1
docker run --name=etcd --hostname=test001 --env=ETCDCTL_API=3 --env=ETCDCTL_CACERT=/etc/kubernetes/ssl/kube-ca.pem --env=ETCDCTL_CERT=/etc/kubernetes/ssl/kube-etcd-172-16-0-106.pem --env=ETCDCTL_KEY=/etc/kubernetes/ssl/kube-etcd-172-16-0-106-key.pem --env=ETCDCTL_ENDPOINTS=https://127.0.0.1:2379 --env=ETCD_UNSUPPORTED_ARCH=x86_64 --volume=/var/lib/etcd:/var/lib/rancher/etcd/:z --volume=/etc/kubernetes:/etc/kubernetes:z --network=host --restart=always --label='io.rancher.rke.container.name=etcd' --runtime=runc --detach=true registry.cn-hangzhou.aliyuncs.com/rancher/mirrored-coreos-etcd:v3.4.15-rancher1 /usr/local/bin/etcd --listen-peer-urls=https://0.0.0.0:2380 --trusted-ca-file=/etc/kubernetes/ssl/kube-ca.pem --peer-trusted-ca-file=/etc/kubernetes/ssl/kube-ca.pem --key-file=/etc/kubernetes/ssl/kube-etcd-172-16-0-106-key.pem --peer-cert-file=/etc/kubernetes/ssl/kube-etcd-172-16-0-106.pem --peer-key-file=/etc/kubernetes/ssl/kube-etcd-172-16-0-106-key.pem --peer-client-cert-auth=true --initial-advertise-peer-urls=https://172.16.0.106:2380 --heartbeat-interval=500 --cert-file=/etc/kubernetes/ssl/kube-etcd-172-16-0-106.pem --advertise-client-urls=https://172.16.0.106:2379 --cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 --initial-cluster=etcd-rke1-server-0=https://172.16.0.106:2380 --initial-cluster-state=new --client-cert-auth=true --listen-client-urls=https://0.0.0.0:2379 --initial-cluster-token=etcd-cluster-1 --name=etcd-rke1-server-0 --enable-v2=true --election-timeout=5000 --data-dir=/var/lib/rancher/etcd/ --force-new-cluster

启动完毕后检查 ETCD 集群状态

1
2
docker exec -it -e ETCDCTL_API=3 etcd etcdctl member list -w table
docker exec -it -e ETCDCTL_API=3 etcd etcdctl endpoint status --cluster -w table

在第一台 Control Plane 节点上添加 ETCD Member

1
2
3
4
5
6
7
8
9
MEMBER_IP=172.16.0.105
MEMBER_NAME="rke1-server-1"
docker exec -it etcd etcdctl member add etcd-$MEMBER_NAME --peer-urls=https://$MEMBER_IP:2380

# 执行完命令后,下面的配置需要保留,后续节点启动 ETCD 时需要使用
ETCD_NAME="etcd-rke1-server-1"
ETCD_INITIAL_CLUSTER="etcd-rke1-server-0=https://172.16.0.106:2380,etcd-rke1-server-1=https://172.16.0.105:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://172.16.0.105:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

然后在第二台 Control Plane 节点,进行恢复

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# 备份数据
mv /var/lib/etcd /var/lib/etcd_bak

# 设置变量
NODE_IP=172.16.0.105
ETCD_IMAGE=registry.cn-hangzhou.aliyuncs.com/rancher/mirrored-coreos-etcd:v3.4.15-rancher1
ETCD_NAME="etcd-rke1-server-1"
ETCD_INITIAL_CLUSTER="etcd-rke1-server-0=https://172.16.0.106:2380,etcd-rke1-server-1=https://172.16.0.105:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://172.16.0.105:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

# 启动 ETCD
docker run --name=etcd --hostname=`hostname` \
--env="ETCDCTL_API=3" \
--env="ETCDCTL_CACERT=/etc/kubernetes/ssl/kube-ca.pem" \
--env="ETCDCTL_CERT=/etc/kubernetes/ssl/kube-etcd-`echo $NODE_IP|sed 's/\./-/g'`.pem" \
--env="ETCDCTL_KEY=/etc/kubernetes/ssl/kube-etcd-`echo $NODE_IP|sed 's/\./-/g'`-key.pem" \
--env="ETCDCTL_ENDPOINTS=https://127.0.0.1:2379" \
--env="ETCD_UNSUPPORTED_ARCH=x86_64" \
--env="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" \
--volume="/var/lib/etcd:/var/lib/rancher/etcd/:z" \
--volume="/etc/kubernetes:/etc/kubernetes:z" \
--network=host \
--restart=always \
--label io.rancher.rke.container.name="etcd" \
--detach=true \
$ETCD_IMAGE \
/usr/local/bin/etcd \
--peer-client-cert-auth \
--client-cert-auth \
--peer-cert-file=/etc/kubernetes/ssl/kube-etcd-`echo $NODE_IP|sed 's/\./-/g'`.pem \
--peer-key-file=/etc/kubernetes/ssl/kube-etcd-`echo $NODE_IP|sed 's/\./-/g'`-key.pem \
--cert-file=/etc/kubernetes/ssl/kube-etcd-`echo $NODE_IP|sed 's/\./-/g'`.pem \
--trusted-ca-file=/etc/kubernetes/ssl/kube-ca.pem \
--initial-cluster-token=etcd-cluster-1 \
--peer-trusted-ca-file=/etc/kubernetes/ssl/kube-ca.pem \
--key-file=/etc/kubernetes/ssl/kube-etcd-`echo $NODE_IP|sed 's/\./-/g'`-key.pem \
--data-dir=/var/lib/rancher/etcd/ \
--advertise-client-urls=https://$NODE_IP:2379 \
--listen-client-urls=https://0.0.0.0:2379 \
--listen-peer-urls=https://0.0.0.0:2380 \
--initial-advertise-peer-urls=https://$NODE_IP:2380 \
--election-timeout=5000 \
--heartbeat-interval=500 \
--name=$ETCD_NAME \
--initial-cluster=$ETCD_INITIAL_CLUSTER \
--initial-cluster-state=$ETCD_INITIAL_CLUSTER_STATE

启动完后检查状态,如果没问题则可以重复上面步骤添加第三台节点

1
2
docker exec -it -e ETCDCTL_API=3 etcd etcdctl member list -w table
docker exec -it -e ETCDCTL_API=3 etcd etcdctl endpoint status --cluster -w table

集群状态正常后,恢复第一台 Control Plane 节点的 etcd

1
2
3
4
docker stop etcd
docker rename etcd etcd-restore
docker rename etcd-old etcd
docker start etcd
Author

Warner Chen

Posted on

2024-10-24

Updated on

2024-10-25

Licensed under

You need to set install_url to use ShareThis. Please set it in _config.yml.
You forgot to set the business or currency_code for Paypal. Please set it in _config.yml.

Comments

You forgot to set the shortname for Disqus. Please set it in _config.yml.