K8s 集群高可用master节点ETCD挂掉如何恢复?

写在前面


  • 很常见的集群运维场景,整理分享
  • 博文内容为 K8s 集群高可用 master 节点故障如何恢复的过程
  • 理解不足小伙伴帮忙指正

不必太纠结于当下,也不必太忧虑未来,当你经历过一些事情的时候,眼前的风景已经和从前不一样了。——村上春树


遇到了什么问题

今天做实验发现 ,集群其中一个 master 节点上的 etcdapiserver 都挂掉了

┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get nodes
NAME                          STATUS   ROLES           AGE    VERSION
vms100.liruilongs.github.io   Ready    control-plane   415d   v1.25.1
vms101.liruilongs.github.io   Ready    control-plane   415d   v1.25.1
vms102.liruilongs.github.io   Ready    control-plane   415d   v1.25.1
vms103.liruilongs.github.io   Ready    <none>          415d   v1.25.1
vms105.liruilongs.github.io   Ready    <none>          415d   v1.25.1
vms106.liruilongs.github.io   Ready    <none>          415d   v1.25.1
┌──[root@vms100.liruilongs.github.io]-[~]
└─$

vms100.liruilongs.github.io 这个节点 上的 apiserveretcd

┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get pod -A -o wide | grep apiserver
kube-system          kube-apiserver-vms100.liruilongs.github.io            0/1     CrashLoopBackOff   1448 (3m23s ago)   415d   192.168.26.100   vms100.liruilongs.github.io   <none>           <none>
kube-system          kube-apiserver-vms101.liruilongs.github.io            1/1     Running            272 (3h18m ago)    415d   192.168.26.101   vms101.liruilongs.github.io   <none>           <none>
kube-system          kube-apiserver-vms102.liruilongs.github.io            1/1     Running            246 (3h18m ago)    415d   192.168.26.102   vms102.liruilongs.github.io   <none>           <none>
┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get pod -A -o wide | grep etcd
kube-system          etcd-vms100.liruilongs.github.io                      0/1     CrashLoopBackOff   1244 (3m6s ago)    415d   192.168.26.100   vms100.liruilongs.github.io   <none>           <none>
kube-system          etcd-vms101.liruilongs.github.io                      1/1     Running            167 (3h18m ago)    415d   192.168.26.101   vms101.liruilongs.github.io   <none>           <none>
kube-system          etcd-vms102.liruilongs.github.io                      1/1     Running            173 (3h18m ago)    415d   192.168.26.102   vms102.liruilongs.github.io   <none>           <none>

查看 keepalived 对应的静态Pod运行正常

┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get pod -A -o wide | grep keep
kube-system          keepalived-vms100.liruilongs.github.io                1/1     Running            63 (3h50m ago)    415d   192.168.26.100   vms100.liruilongs.github.io   <none>           <none>
kube-system          keepalived-vms101.liruilongs.github.io                1/1     Running            54 (3h51m ago)    415d   192.168.26.101   vms101.liruilongs.github.io   <none>           <none>
kube-system          keepalived-vms102.liruilongs.github.io                1/1     Running            60 (3h51m ago)    415d   192.168.26.102   vms102.liruilongs.github.io   <none>           <none>
┌──[root@vms100.liruilongs.github.io]-[~]
└─$

所以可能是 etcd 数据不同步,或者什么原因 导致etcd 挂掉了。因为 每个 master 节点的 apiserver 只和 本节点的 etcd 进行 通信(每个 etcd 的写请求会转发到 etcd 的领导节点),etcd 挂掉,apiserver 无法提供能力,所以也会挂掉。

通过 etcdctl 可以发现 vms100.liruilongs.github.io 上的 etcd 彻底死掉了

┌──[root@vms100.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379 \--cert="/etc/kubernetes/pki/etcd/server.crt" \--key="/etc/kubernetes/pki/etcd/server.key"  \--cacert="/etc/kubernetes/pki/etcd/ca.crt" \member list -w table
Error: dial tcp 127.0.0.1:2379: connect: connection refused

如何排查

这里我们换一个 etcd 节点 执行 命令

查看 etcd 集群成员

┌──[root@vms100.liruilongs.github.io]-[~]
└─$ssh vms101.liruilongs.github.io
Last login: Sat Mar  2 09:52:01 2024 from 192.168.26.100
┌──[root@vms101.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379 \--cert="/etc/kubernetes/pki/etcd/server.crt"  \--key="/etc/kubernetes/pki/etcd/server.key" \--cacert="/etc/kubernetes/pki/etcd/ca.crt" \member list -w table
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
|        ID        | STATUS  |            NAME             |         PEER ADDRS          |        CLIENT ADDRS         |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
|  ee392e5273e89e2 | started | vms100.liruilongs.github.io | https://192.168.26.100:2380 | https://192.168.26.100:2379 |
| 70059e836d19883d | started | vms101.liruilongs.github.io | https://192.168.26.101:2380 | https://192.168.26.101:2379 |
| b8cb9f66c2e63b91 | started | vms102.liruilongs.github.io | https://192.168.26.102:2380 | https://192.168.26.102:2379 |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+

查看节点状态

┌──[root@vms101.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379 \--cert="/etc/kubernetes/pki/etcd/server.crt" \--key="/etc/kubernetes/pki/etcd/server.key"  \--cacert="/etc/kubernetes/pki/etcd/ca.crt" \endpoint status --cluster  -w table
Failed to get the status of endpoint https://192.168.26.100:2379 (context deadline exceeded)
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+
|          ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://192.168.26.101:2379 | 70059e836d19883d |   3.5.4 |   88 MB |     false |       603 |   22208417 |
| https://192.168.26.102:2379 | b8cb9f66c2e63b91 |   3.5.4 |   88 MB |      true |       603 |   22208417 |
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+

确定 ETCD 节点故障

┌──[root@vms101.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379  \--cert="/etc/kubernetes/pki/etcd/server.crt"  \--key="/etc/kubernetes/pki/etcd/server.key"  \--cacert="/etc/kubernetes/pki/etcd/ca.crt" \endpoint  health --cluster  -w table
https://192.168.26.101:2379 is healthy: successfully committed proposal: took = 3.753357ms
https://192.168.26.102:2379 is healthy: successfully committed proposal: took = 2.989943ms
https://192.168.26.100:2379 is unhealthy: failed to connect: dial tcp 192.168.26.100:2379: connect: connection refused
Error: unhealthy cluster

查看 etcd 的容器日志

┌──[root@vms100.liruilongs.github.io]-[~]
└─$docker ps -a | grep etcd
0f2f98ebf8c3   a8a176a5d5d6                                        "etcd --advertise-cl…"   4 minutes ago   Exited (2) 4 minutes ago                  k8s_etcd_etcd-vms100.liruilongs.github.io_kube-system_e8c17bb99f9bd8119cdd769556041e18_1252
a4b39d16a753   registry.aliyuncs.com/google_containers/pause:3.8   "/pause"                 4 hours ago     Up 4 hours                                k8s_POD_etcd-vms100.liruilongs.github.io_kube-system_e8c17bb99f9bd8119cdd769556041e18_54
┌──[root@vms100.liruilongs.github.io]-[~]
└─$docker logs 0f2f98ebf8c3
{"level":"info","ts":"2024-03-16T14:46:54.644Z","caller":"etcdmain/etcd.go:73","msg":"Running: ","args":["etcd","--advertise-client-urls=https://192.168.26.100:2379","--cert-file=/etc/kubernetes/pki/etcd/server.crt","--client-cert-auth=true","--data-dir=/var/lib/etcd","--experimental-initial-corrupt-check=true","--experimental-watch-progress-notify-interval=5s","--initial-advertise-peer-urls=https://192.168.26.100:2380","--initial-cluster=vms100.liruilongs.github.io=https://192.168.26.100:2380","--key-file=/etc/kubernetes/pki/etcd/server.key","--listen-client-urls=https://127.0.0.1:2379,https://192.168.26.100:2379","--listen-metrics-urls=http://127.0.0.1:2381","--listen-peer-urls=https://192.168.26.100:2380","--name=vms100.liruilongs.github.io","--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt","--peer-client-cert-auth=true","--peer-key-file=/etc/kubernetes/pki/etcd/peer.key","--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt","--snapshot-count=10000","--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt"]}
{"level":"info","ts":"2024-03-16T14:46:54.645Z","caller":"etcdmain/etcd.go:116","msg":"server has been already initialized","data-dir":"/var/lib/etcd","dir-type":"member"}
{"level":"info","ts":"2024-03-16T14:46:54.645Z","caller":"embed/etcd.go:131","msg":"configuring peer listeners","listen-peer-urls":["https://192.168.26.100:2380"]}
{"level":"info","ts":"2024-03-16T14:46:54.645Z","caller":"embed/etcd.go:479","msg":"starting with peer TLS","tls-info":"cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, client-cert=, client-key=, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file = ","cipher-suites":[]}
{"level":"info","ts":"2024-03-16T14:46:54.645Z","caller":"embed/etcd.go:139","msg":"configuring client listeners","listen-client-urls":["https://127.0.0.1:2379","https://192.168.26.100:2379"]}
{"level":"info","ts":"2024-03-16T14:46:54.645Z","caller":"embed/etcd.go:308","msg":"starting an etcd server","etcd-version":"3.5.4","git-sha":"08407ff76","go-version":"go1.16.15","go-os":"linux","go-arch":"amd64","max-cpu-set":4,"max-cpu-available":4,"member-initialized":true,"name":"vms100.liruilongs.github.io","data-dir":"/var/lib/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":10000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["https://192.168.26.100:2380"],"listen-peer-urls":["https://192.168.26.100:2380"],"advertise-client-urls":["https://192.168.26.100:2379"],"listen-client-urls":["https://127.0.0.1:2379","https://192.168.26.100:2379"],"listen-metrics-urls":["http://127.0.0.1:2381"],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-size-bytes":2147483648,"pre-vote":true,"initial-corrupt-check":true,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
panic: freepages: failed to get all reachable pages (page 7744: multiple references)goroutine 109 [running]:
go.etcd.io/bbolt.(*DB).freepages.func2(0xc00009c480)/go/pkg/mod/go.etcd.io/bbolt@v1.3.6/db.go:1056 +0xe9
created by go.etcd.io/bbolt.(*DB).freepages/go/pkg/mod/go.etcd.io/bbolt@v1.3.6/db.go:1054 +0x1cd
┌──[root@vms100.liruilongs.github.io]-[~]
└─$

如何解决

这里最快的办法是重新同步一下这个节点的数据,即把这个故障节点移出 集群,清理完故障节点旧数据在重新添加,操作步骤

  • 清理数据目录,移动静态Pod 的yaml 文件:停止故障节点服务,然后删除etcd数据目录。
  • 移除故障节点:使用member remove命令剔除错误节点,可以在健康的节点执行命令。
  • 添加节点:使用member add命令添加故障节点。
  • 重新启动:移动故障节点yaml文件,进行启动

: 静态Pod 通过加载指定目录的 yaml 文件来调度,kubelet 会定时扫描,删除移动 yaml 文件,静态 Pod 会自动停止,同理。添加 yaml 文件会自动创建静态 Pod

移动静态Pod 的yaml 文件

┌──[root@vms100.liruilongs.github.io]-[~]
└─$mv  /etc/kubernetes/manifests/{etcd.yaml,kube-apiserver.yaml}  /tmp/

删除etcd数据目录

┌──[root@vms100.liruilongs.github.io]-[~]
└─$rm -rf /var/lib/etcd/*

确认节点 的 etcdapiservier 都已经停止

┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get pod -A -o wide | grep apiserver
kube-system          kube-apiserver-vms101.liruilongs.github.io            1/1     Running   272 (4h15m ago)   415d   192.168.26.101   vms101.liruilongs.github.io   <none>           <none>
kube-system          kube-apiserver-vms102.liruilongs.github.io            1/1     Running   246 (4h15m ago)   415d   192.168.26.102   vms102.liruilongs.github.io   <none>           <none>
┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get pod -A -o wide | grep etcd
kube-system          etcd-vms101.liruilongs.github.io                      1/1     Running   167 (4h15m ago)   415d   192.168.26.101   vms101.liruilongs.github.io   <none>           <none>
kube-system          etcd-vms102.liruilongs.github.io                      1/1     Running   173 (4h15m ago)   415d   192.168.26.102   vms102.liruilongs.github.io   <none>           <none>
┌──[root@vms100.liruilongs.github.io]-[~]
└─$

获取故障节点 ID,下面的操作我们在健康的 etcd 节点执行,或者可以修改 --endpoints

┌──[root@vms101.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl  --endpoints https://192.168.26.101:2379  --cert="/etc/kubernetes/pki/etcd/server.crt"  --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt" member list -w table
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
|        ID        | STATUS  |            NAME             |         PEER ADDRS          |        CLIENT ADDRS         |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
|  ee392e5273e89e2 | started | vms100.liruilongs.github.io | https://192.168.26.100:2380 | https://192.168.26.100:2379 |
| 70059e836d19883d | started | vms101.liruilongs.github.io | https://192.168.26.101:2380 | https://192.168.26.101:2379 |
| b8cb9f66c2e63b91 | started | vms102.liruilongs.github.io | https://192.168.26.102:2380 | https://192.168.26.102:2379 |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+

移除故障节点

┌──[root@vms101.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379  --cert="/etc/kubernetes/pki/etcd/server.crt"  --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt"  member remove ee392e5273e89e2
Member  ee392e5273e89e2 removed from cluster 4816f346663d82a7

重新添加

┌──[root@vms101.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379  --cert="/etc/kubernetes/pki/etcd/server.crt"  --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt"  member add vms100.liruilongs.github.io --peer-urls=https://192.168.26.100:2380
Member 456f71fdc1ad9917 added to cluster 4816f346663d82a7ETCD_NAME="vms100.liruilongs.github.io"
ETCD_INITIAL_CLUSTER="vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.26.100:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

回到 100 节点机器,移动 Yaml 文件,恢复节点

┌──[root@vms100.liruilongs.github.io]-[~]
└─$mv /tmp/{etcd.yaml,kube-apiserver.yaml} /etc/kubernetes/manifests/

确认 Pod 状态

┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get pod -A -o wide | grep etcd
kube-system          etcd-vms100.liruilongs.github.io                      1/1     Running   0                 16s    192.168.26.100   vms100.liruilongs.github.io   <none>           <none>
kube-system          etcd-vms101.liruilongs.github.io                      1/1     Running   167 (4h32m ago)   415d   192.168.26.101   vms101.liruilongs.github.io   <none>           <none>
kube-system          etcd-vms102.liruilongs.github.io                      1/1     Running   173 (4h32m ago)   415d   192.168.26.102   vms102.liruilongs.github.io   <none>           <none>
┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get pod -A -o wide | grep apiserver
kube-system          kube-apiserver-vms100.liruilongs.github.io            1/1     Running   0                 24s    192.168.26.100   vms100.liruilongs.github.io   <none>           <none>
kube-system          kube-apiserver-vms101.liruilongs.github.io            1/1     Running   272 (4h32m ago)   415d   192.168.26.101   vms101.liruilongs.github.io   <none>           <none>
kube-system          kube-apiserver-vms102.liruilongs.github.io            1/1     Running   246 (4h32m ago)   415d   192.168.26.102   vms102.liruilongs.github.io   <none>           <none>
┌──[root@vms100.liruilongs.github.io]-[~]
└─$

查看 etcd 集群状态

┌──[root@vms101.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379  --cert="/etc/kubernetes/pki/etcd/server.crt"  --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt" member list -w table
+------------------+-----------+-----------------------------+-----------------------------+-----------------------------+
|        ID        |  STATUS   |            NAME             |         PEER ADDRS          |        CLIENT ADDRS         |
+------------------+-----------+-----------------------------+-----------------------------+-----------------------------+
| 54952f3b494c0286 | unstarted |                             | https://192.168.26.100:2380 |                             |
| 70059e836d19883d |   started | vms101.liruilongs.github.io | https://192.168.26.101:2380 | https://192.168.26.101:2379 |
| b8cb9f66c2e63b91 |   started | vms102.liruilongs.github.io | https://192.168.26.102:2380 | https://192.168.26.102:2379 |
+------------------+-----------+-----------------------------+-----------------------------+-----------------------------+

这里我们发现 新添加的节点状态不正常,一直是 unstarted

我们在 故障节点执行 etcd 命令。发现故障节点并没有添加到集群,而是作为一个单节点运行。

┌──[root@vms100.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379  --cert="/etc/kubernetes/pki/etcd/server.crt"  --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt" member list -w table
+-----------------+---------+-----------------------------+-----------------------------+-----------------------------+
|       ID        | STATUS  |            NAME             |         PEER ADDRS          |        CLIENT ADDRS         |
+-----------------+---------+-----------------------------+-----------------------------+-----------------------------+
| ee392e5273e89e2 | started | vms100.liruilongs.github.io | https://192.168.26.100:2380 | https://192.168.26.100:2379 |
+-----------------+---------+-----------------------------+-----------------------------+-----------------------------+
┌──[root@vms100.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379  --cert="/etc/kubernetes/pki/etcd/server.crt"  --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt" endpoint status --cluster  -w table
+-----------------------------+-----------------+---------+---------+-----------+-----------+------------+
|          ENDPOINT           |       ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------------------+-----------------+---------+---------+-----------+-----------+------------+
| https://192.168.26.100:2379 | ee392e5273e89e2 |   3.5.4 |  815 kB |      true |         2 |       2261 |
+-----------------------------+-----------------+---------+---------+-----------+-----------+------------+
┌──[root@vms100.liruilongs.github.io]-[~]
└─$

也没有同步 当前集群的数据

┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get pod -A -o wide  --server=https://vms100.liruilongs.github.io:6443
No resources found

遇到这种情况,大部分原因是 某个节点的 etcd配置文件的问题,我的这个问题是 故障节点的 etcd 配置文件,没有集群信息相关配置,所以这里把集群相关配置写入配置

原本的配置文件

┌──[root@vms100.liruilongs.github.io]-[~]
└─$cat /etc/kubernetes/manifests/etcd.yaml
apiVersion: v1
kind: Pod
metadata:annotations:kubeadm.kubernetes.io/etcd.advertise-client-urls: https://192.168.26.100:2379creationTimestamp: nulllabels:component: etcdtier: control-planename: etcdnamespace: kube-system
spec:containers:- command:- etcd- --advertise-client-urls=https://192.168.26.100:2379- --cert-file=/etc/kubernetes/pki/etcd/server.crt- --client-cert-auth=true- --data-dir=/var/lib/etcd- --experimental-initial-corrupt-check=true- --experimental-watch-progress-notify-interval=5s- --initial-advertise-peer-urls=https://192.168.26.100:2380- --initial-cluster=vms100.liruilongs.github.io=https://192.168.26.100:2380- --key-file=/etc/kubernetes/pki/etcd/server.key- --listen-client-urls=https://127.0.0.1:2379,https://192.168.26.100:2379- --listen-metrics-urls=http://127.0.0.1:2381- --listen-peer-urls=https://192.168.26.100:2380- --name=vms100.liruilongs.github.io- --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt- --peer-client-cert-auth=true- --peer-key-file=/etc/kubernetes/pki/etcd/peer.key- --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt- --snapshot-count=10000- --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crtimage: registry.aliyuncs.com/google_containers/etcd:3.5.4-0
。。。。。。。。。。。。。。。。

集群信息不全的,添加后的配置文件

┌──[root@vms100.liruilongs.github.io]-[~]
└─$cat /etc/kubernetes/manifests/etcd.yaml
apiVersion: v1
kind: Pod
metadata:annotations:kubeadm.kubernetes.io/etcd.advertise-client-urls: https://192.168.26.100:2379creationTimestamp: nulllabels:component: etcdtier: control-planename: etcdnamespace: kube-system
spec:containers:- command:- etcd- --advertise-client-urls=https://192.168.26.100:2379- --cert-file=/etc/kubernetes/pki/etcd/server.crt- --client-cert-auth=true- --data-dir=/var/lib/etcd- --experimental-initial-corrupt-check=true- --experimental-watch-progress-notify-interval=5s- --initial-advertise-peer-urls=https://192.168.26.100:2380- --initial-cluster=vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380- --initial-cluster-state=existing- --key-file=/etc/kubernetes/pki/etcd/server.key- --listen-client-urls=https://127.0.0.1:2379,https://192.168.26.100:2379- --listen-metrics-urls=http://127.0.0.1:2381- --listen-peer-urls=https://192.168.26.100:2380- --name=vms100.liruilongs.github.io- --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt- --peer-client-cert-auth=true- --peer-key-file=/etc/kubernetes/pki/etcd/peer.key- --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt- --snapshot-count=10000- --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt

然后我们以上面相同的方式从新恢复一次,发现节点直接没有起来

┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get pod -A -o wide | grep apiserver
kube-system          kube-apiserver-vms100.liruilongs.github.io            0/1     CrashLoopBackOff   1 (18s ago)       39s    192.168.26.100   vms100.liruilongs.github.io   <none>           <none>
kube-system          kube-apiserver-vms101.liruilongs.github.io            1/1     Running            272 (5h29m ago)   415d   192.168.26.101   vms101.liruilongs.github.io   <none>           <none>
kube-system          kube-apiserver-vms102.liruilongs.github.io            1/1     Running            246 (5h29m ago)   415d   192.168.26.102   vms102.liruilongs.github.io   <none>           <none>
┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get pod -A -o wide | grep etcd
kube-system          etcd-vms100.liruilongs.github.io                      0/1     CrashLoopBackOff   3 (21s ago)       53s    192.168.26.100   vms100.liruilongs.github.io   <none>           <none>
kube-system          etcd-vms101.liruilongs.github.io                      1/1     Running            167 (5h29m ago)   415d   192.168.26.101   vms101.liruilongs.github.io   <none>           <none>
kube-system          etcd-vms102.liruilongs.github.io                      1/1     Running            173 (5h29m ago)   415d   192.168.26.102   vms102.liruilongs.github.io   <none>           <none>

查看日志

┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl logs etcd-vms100.liruilongs.github.io -n kube-system
.............................
{"level":"fatal","ts":"2024-03-16T16:25:19.981Z","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"error validating peerURLs {ClusterID:4816f346663d82a7 Members:[&{ID:b8cb9f66c2e63b91 RaftAttributes:{PeerURLs:[https://192.168.26.102:2380] IsLearner:false} Attributes:{Name:vms102.liruilongs.github.io ClientURLs:[https://192.168.26.102:2379]}} &{ID:3fbbbed942c51f7b RaftAttributes:{PeerURLs:[https://192.168.26.100:2380] IsLearner:false} Attributes:{Name: ClientURLs:[]}} &{ID:70059e836d19883d RaftAttributes:{PeerURLs:[https://192.168.26.101:2380] IsLearner:false} Attributes:{Name:vms101.liruilongs.github.io ClientURLs:[https://192.168.26.101:2379]}}] RemovedMemberIDs:[]}: member count is unequal","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}

根据日志信息,可以看到有用的信息 RemovedMemberIDs:[]}: member count is unequal ,成员数量不相等,在分析日志

{"level": "info","ts": "2024-03-16T16:25:19.961Z","caller": "etcdmain/etcd.go:73","msg": "Running: ","args": ["etcd","--advertise-client-urls=https://192.168.26.100:2379","--cert-file=/etc/kubernetes/pki/etcd/server.crt","--client-cert-auth=true","--data-dir=/var/lib/etcd","--experimental-initial-corrupt-check=true","--experimental-watch-progress-notify-interval=5s","--initial-advertise-peer-urls=https://192.168.26.100:2380","--initial-cluster=vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380","--initial-cluster-state=existing","--key-file=/etc/kubernetes/pki/etcd/server.key","--listen-client-urls=https://127.0.0.1:2379,https://192.168.26.100:2379","--listen-metrics-urls=http://127.0.0.1:2381","--listen-peer-urls=https://192.168.26.100:2380","--name=vms100.liruilongs.github.io","--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt","--peer-client-cert-auth=true","--peer-key-file=/etc/kubernetes/pki/etcd/peer.key","--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt","--snapshot-count=10000","--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt"]
}
..............................................................................
{"level": "warn","ts": "2024-03-16T16:25:19.981Z","caller": "etcdmain/etcd.go:146","msg": "failed to start etcd","error": "error validating peerURLs {ClusterID:4816f346663d82a7 Members:[&{ID:b8cb9f66c2e63b91 RaftAttributes:{PeerURLs:[https://192.168.26.102:2380] IsLearner:false} Attributes:{Name:vms102.liruilongs.github.io ClientURLs:[https://192.168.26.102:2379]}} &{ID:3fbbbed942c51f7b RaftAttributes:{PeerURLs:[https://192.168.26.100:2380] IsLearner:false} Attributes:{Name: ClientURLs:[]}} &{ID:70059e836d19883d RaftAttributes:{PeerURLs:[https://192.168.26.101:2380] IsLearner:false} Attributes:{Name:vms101.liruilongs.github.io ClientURLs:[https://192.168.26.101:2379]}}] RemovedMemberIDs:[]}: member count is unequal"
}
{"level": "fatal","ts": "2024-03-16T16:25:19.981Z","caller": "etcdmain/etcd.go:204","msg": "discovery failed","error": "error validating peerURLs {ClusterID:4816f346663d82a7 Members:[&{ID:b8cb9f66c2e63b91 RaftAttributes:{PeerURLs:[https://192.168.26.102:2380] IsLearner:false} Attributes:{Name:vms102.liruilongs.github.io ClientURLs:[https://192.168.26.102:2379]}} &{ID:3fbbbed942c51f7b RaftAttributes:{PeerURLs:[https://192.168.26.100:2380] IsLearner:false} Attributes:{Name: ClientURLs:[]}} &{ID:70059e836d19883d RaftAttributes:{PeerURLs:[https://192.168.26.101:2380] IsLearner:false} Attributes:{Name:vms101.liruilongs.github.io ClientURLs:[https://192.168.26.101:2379]}}] RemovedMemberIDs:[]}: member count is unequal","stacktrace": "go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"
}

可以看到它提示 可能错误与 vms102.liruilongs.github.io 节点相关

然后我们看一下 vms102.liruilongs.github.io 的配置文件

┌──[root@vms102.liruilongs.github.io]-[~]
└─$cat /etc/kubernetes/manifests/etcd.yaml
apiVersion: v1
kind: Pod
metadata:annotations:kubeadm.kubernetes.io/etcd.advertise-client-urls: https://192.168.26.102:2379creationTimestamp: nulllabels:component: etcdtier: control-planename: etcdnamespace: kube-system
spec:containers:- command:- etcd- --advertise-client-urls=https://192.168.26.102:2379- --cert-file=/etc/kubernetes/pki/etcd/server.crt- --client-cert-auth=true- --data-dir=/var/lib/etcd- --experimental-initial-corrupt-check=true- --experimental-watch-progress-notify-interval=5s- --initial-advertise-peer-urls=https://192.168.26.102:2380- --initial-cluster=vms100.liruilongs.github.io=https://192.168.26.100:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380- --initial-cluster-state=existing- --key-file=/etc/kubernetes/pki/etcd/server.key- --listen-client-urls=https://127.0.0.1:2379,https://192.168.26.102:2379- --listen-metrics-urls=http://127.0.0.1:2381- --listen-peer-urls=https://192.168.26.102:2380- --name=vms102.liruilongs.github.io- --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt- --peer-client-cert-auth=true- --peer-key-file=/etc/kubernetes/pki/etcd/peer.key- --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt- --snapshot-count=10000- --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt

通过配置文件比对,可以发现,之前配置的故障节点的配置任然有问题,少了一个vms102.liruilongs.github.io=https://192.168.26.102:2380节点信息。

"--initial-cluster=vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380",
"--initial-cluster=vms100.liruilongs.github.io=https://192.168.26.100:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380"

修改完配置,按照上面相同的流程重新恢复节点, 节点恢复

通过 etcdctl 命令检查

┌──[root@vms100.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379  --cert="/etc/kubernetes/pki/etcd/server.crt"  --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt" member list -w table
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
|        ID        | STATUS  |            NAME             |         PEER ADDRS          |        CLIENT ADDRS         |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
| 70059e836d19883d | started | vms101.liruilongs.github.io | https://192.168.26.101:2380 | https://192.168.26.101:2379 |
| ac5f6045dbe477b3 | started | vms100.liruilongs.github.io | https://192.168.26.100:2380 | https://192.168.26.100:2379 |
| b8cb9f66c2e63b91 | started | vms102.liruilongs.github.io | https://192.168.26.102:2380 | https://192.168.26.102:2379 |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
┌──[root@vms100.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379  --cert="/etc/kubernetes/pki/etcd/server.crt"  --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt" endpoint status --cluster  -w table
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+
|          ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://192.168.26.101:2379 | 70059e836d19883d |   3.5.4 |   88 MB |     false |       603 |   22227327 |
| https://192.168.26.100:2379 | ac5f6045dbe477b3 |   3.5.4 |   88 MB |     false |       603 |   22227327 |
| https://192.168.26.102:2379 | b8cb9f66c2e63b91 |   3.5.4 |   88 MB |      true |       603 |   22227327 |
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+
┌──[root@vms100.liruilongs.github.io]-[~]
└─$

故障节点恢复,在实际的操作中,添加完节点,我们需要确认故障节点的配置文件是否是正确的配置文件


© 2018-2024 liruilonger@gmail.com, All rights reserved. 保持署名-非商用-相同方式共享(CC BY-NC-SA 4.0)

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/752182.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

【JVM】(内存区域划分 为什么要划分 具体如何分 类加载机制 类加载基本流程 双亲委派模型 类加载器 垃圾回收机制(GC))

文章目录 内存区域划分为什么要划分具体如何分 类加载机制类加载基本流程双亲委派模型类加载器 垃圾回收机制&#xff08;GC&#xff09; 内存区域划分 为什么要划分 JVM启动的时候会申请到一整个很大的内存区域,JVM是一个应用程序,要从操作系统这里申请内存,JVM就需要根据,把…

【Leetcode-73.矩阵置零】

题目&#xff1a; 给定一个 m x n 的矩阵&#xff0c;如果一个元素为 0 &#xff0c;则将其所在行和列的所有元素都设为 0 。请使用 原地 算法。 示例 1&#xff1a; 输入&#xff1a;matrix [[1,1,1],[1,0,1],[1,1,1]] 输出&#xff1a;[[1,0,1],[0,0,0],[1,0,1]]示例 2&…

如何写好Stable Diffusion的prompt

Stable Diffusion是一种强大的文本到图像生成模型&#xff0c;其效果在很大程度上取决于输入的提示词&#xff08;Prompt&#xff09;。以下是一些关于如何编写有效的Stable Diffusion Prompt的秘诀&#xff1a; 明确描述&#xff1a;尽量清晰地描述你想要的图像内容。使用具体…

Mysql8和Mysql5加锁规则的细微不同

前言 验证版本 分别为Mysql5.7.22和Mysql8.0.26 关于Mysql5.7.22的加锁规则请查看 MySQL行锁加锁规则之等值查询 和 MySQL行锁加锁规则之范围查询 两个版本的等值查询规则一样&#xff0c;区别在于范围查询的“向右匹配到第一个不符合记录” 的加锁规则&#xff0c;Mysql5.…

2024全新返佣商城分销商城理财商城系统源码 全开源PHP+VUE源码

2023全新返佣商城分销商城理财商城系统源码 全开源PHPVUE源码 程序安装环境要求&#xff1a; nginx1.16 php7.2 mysql5.6 程序全开源PHPVUE源码 有需要测试的老铁&#xff0c;拿去测试吧

Linux_基础指令(一)

目录 1、ls指令 1.1 ls -l 1.2 ls -a 1.3 ls -i 2、pwd指令 3、cd指令 3.1 路径的概念 3.1.1 绝对路径 3.1.2 相对路径 3.2 cd ~ 3.3 cd - 4、touch指令 5、mkdir指令 6、删除系列的指指令 6.1 rmdir 6.2 rm 7、man指令 8、cp指令 9、move指令 结…

【Java】十大排序

目录 冒泡排序 选择排序 插入排序 希尔排序 归并排序 快速排序 堆排序 计数排序 桶排序 基数排序 冒泡排序 冒泡排序(Bubble Sort)是一种简单的排序算法。它重复地遍历要排序的序列&#xff0c;依次比较两个元素&#xff0c;如果它们的顺序错误就把它们交换过来。遍历…

论文阅读_参数微调_P-tuning_v2

1 P-Tuning PLAINTEXT 1 2 3 4 5 6 7英文名称: GPT Understands, Too 中文名称: GPT也懂 链接: https://arxiv.org/abs/2103.10385 作者: Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, Jie Tang 机构: 清华大学, 麻省理工学院 日期: 2021-03-18…

cache的58问,您能回答上几个

快速链接: . &#x1f449;&#x1f449;&#x1f449; 个人博客笔记导读目录(全部) &#x1f448;&#x1f448;&#x1f448; 付费专栏-付费课程 【购买须知】: 【精选】ARMv8/ARMv9架构入门到精通-[目录] &#x1f448;&#x1f448;&#x1f448;联系方式-加入交流群 ---…

东京旅行攻略:机票、交通、消费和看富士山

欢迎关注「苏南下」 在这里分享我的旅行和影像创作心得 分享一些最近从香港去日本东京的短期周末旅行体验。 1&#xff1a;香港飞东京需要4小时&#xff0c;非旺季往返的价格在2000元左右。去年五一假期&#xff0c;我提前了两个多月买的香港快运就是这个价&#xff0c;不算贵。…

微信小程序的页面制作---常用组件及其属性

微信小程序里的组件就是html里的标签&#xff0c;但其组件都自带UI风格和特定的功能效果 一、常用组件 view&#xff08;视图容器&#xff09;、text&#xff08;文本&#xff09;、button&#xff08;按钮&#xff09;、image&#xff08;图片&#xff09;、form&#xff08…

Sentinel篇:线程隔离和熔断降级

书接上回&#xff1a;微服务&#xff1a;Sentinel篇 3. 隔离和降级 限流是一种预防措施&#xff0c;虽然限流可以尽量避免因高并发而引起的服务故障&#xff0c;但服务还会因为其它原因而故障。 而要将这些故障控制在一定范围&#xff0c;避免雪崩&#xff0c;就要靠线程隔离…

【Visual Studio】VS转换文件为UTF8格式

使用高级保存选项 更改VS的编码方案 首先需要打开高级保存选项 然后打开 文件 —> 高级保存选项 即可进行设置

OSPF协议全面学习笔记

作者&#xff1a;BSXY_19计科_陈永跃 BSXY_信息学院 注&#xff1a;未经允许禁止转发任何内容 OSPF协议全面学习笔记 1、OSPF基础2、DR与BDR3、OSPF多区域4、虚链路Vlink5、OSPF报文6、LSA结构1、一类/二类LSA&#xff08;Router-LSA/Network-LSA&#xff09; 更新完善中... 1、…

ChatGPT :确定性AI源自于确定性数据

ChatGPT 幻觉 大模型实际应用落地过程中&#xff0c;会遇到幻觉&#xff08;Hallucination&#xff09;问题。对于语言模型而言&#xff0c;当生成的文本语法正确流畅&#xff0c;但不遵循原文&#xff08;Faithfulness&#xff09;&#xff0c;或不符合事实&#xff08;Factua…

解决被反射的类无法被spring管理,空指针异常问题

目录 问题 原因 解决方案 问题 这是我在开发语音指令识别功能时遇到的问题&#xff0c;在将语音转成文字后需要进行指令匹配&#xff0c;指令匹配成功&#xff0c;就会获得对应的处理方法名称以及所属类&#xff0c;然后通过反射机制去执行该方法完成指令对应的操作 部分代…

C++——string

一学习string的原因 1.从个人理解角度上&#xff1a; 在刚开始学习之前&#xff0c;我只知道学习完string在以后的刷题中能提高做题效率&#xff0c;在对字符串的处理string库中也许有对应的接口去实现需求&#xff0c;不用自己去写函数的实现。 但在学string中改变了之前的…

谈谈IC、ASIC、SoC、MPU、MCU、CPU、GPU、DSP、FPGA、CPLD的简介

谈谈IC、ASIC、SoC、MPU、MCU、CPU、GPU、DSP、FPGA、CPLD的简介 IC (Integrated Circuit) 集成电路 (Integrated Circuit, IC) 是一种把电路中的元器件如电阻、电容、晶体管等集成在一块半导体材料上的微型电子器件。它是现代电子系统的基础组件&#xff0c;按照功能可分为模…

【PyTorch】基础学习:一文详细介绍 torch.save() 的用法和应用

【PyTorch】基础学习&#xff1a;一文详细介绍 torch.save() 的用法和应用 &#x1f308; 个人主页&#xff1a;高斯小哥 &#x1f525; 高质量专栏&#xff1a;Matplotlib之旅&#xff1a;零基础精通数据可视化、Python基础【高质量合集】、PyTorch零基础入门教程&#x1f44…

Flask中的Blueprints:模块化和组织大型Web应用【第142篇—Web应用】

&#x1f47d;发现宝藏 前些天发现了一个巨牛的人工智能学习网站&#xff0c;通俗易懂&#xff0c;风趣幽默&#xff0c;忍不住分享一下给大家。【点击进入巨牛的人工智能学习网站】。 Flask中的Blueprints&#xff1a;模块化和组织大型Web应用 在构建大型Web应用时&#xff0…