1、问题描述
客户反馈UAT测试环境的kubernets集群节点状态NotReady,无法创建新资源,影响到了厂家业务部署测试。为了不影响业务厂家应用发布测试,现场工程师通过重启kublet与docker容器服务来恢复kubernets集群。对该UAT的Kubernetes集群故障时间范围跟踪分析。具体故障信息如下图所示。
2、问题分析
2.1、分析过程
1.
查看/var/log/messages日志信息详情:
|
Mar 10 10:36:18 UAT-K8S-MASTER01 systemd: Started Session 26305 of user root. Mar 10 10:36:18 UAT-K8S-MASTER01 systemd-logind: New session 26305 of user root. <<登录root用户 Mar 10 10:36:19 UAT-K8S-MASTER01 kubelet: E0310 10:36:19.780889 20272 file.go:182] Not recursing into manifest path "/etc/kubernetes/manifests/keepalived" Mar 10 10:36:19 UAT-K8S-MASTER01 kubelet: E0310 10:36:19.783096 20272 file.go:182] Not recursing into manifest path "/etc/kubernetes/manifests/nginx-lb" Mar 10 10:36:39 UAT-K8S-MASTER01 kubelet: E0310 10:36:39.780897 20272 file.go:182] Not recursing into manifest path "/etc/kubernetes/manifests/keepalived" Mar 10 10:36:39 UAT-K8S-MASTER01 kubelet: E0310 10:36:39.783123 20272 file.go:182] Not recursing into manifest path "/etc/kubernetes/manifests/nginx-lb" Mar 10 10:36:59 UAT-K8S-MASTER01 kubelet: E0310 10:36:59.780902 20272 file.go:182] Not recursing into manifest path "/etc/kubernetes/manifests/keepalived" Mar 10 10:36:59 UAT-K8S-MASTER01 kubelet: E0310 10:36:59.783046 20272 file.go:182] Not recursing into manifest path "/etc/kubernetes/manifests/nginx-lb" Mar 10 10:37:03 UAT-K8S-MASTER01 systemd: Stopping firewalld - dynamic firewall daemon... << 关闭防火墙操作
Mar 10 10:37:03 UAT-K8S-MASTER01 kernel: IPVS: [sh] scheduler unregistered.
< Mar 10 10:37:03 UAT-K8S-MASTER01 kernel: IPVS: [wrr] scheduler unregistered. Mar 10 10:37:03 UAT-K8S-MASTER01 kernel: IPVS: [rr] scheduler unregistered. Mar 10 10:37:03 UAT-K8S-MASTER01 kernel: IPVS: ipvs unloaded. // 并且出现IPVS模块卸载的日志信息 Mar 10 10:37:04 UAT-K8S-MASTER01 systemd: Stopped firewalld - dynamic firewall daemon. Mar 10 10:37:04 UAT-K8S-MASTER01 kubelet: E0310 10:37:04.573494 20272 controller.go:178] failed to update node lease, error: Put " dial tcp 172.31.2 50.21:16443: connect: connection refused <<开始出现拒绝连接报错,并一直连续报错 Mar 10 10:37:04 UAT-K8S-MASTER01 kubelet: E0310 10:37:04.573696 20272 controller.go:178] failed to update node lease, error: Put " dial tcp 172.31.2 50.21:16443: connect: connection refused Mar 10 10:37:04 UAT-K8S-MASTER01 kubelet: E0310 10:37:04.573881 20272 controller.go:178] failed to update node lease, error: Put " dial tcp 172.31.2 50.21:16443: connect: connection refused Mar 10 10:37:04 UAT-K8S-MASTER01 kubelet: E0310 10:37:04.574009 20272 controller.go:178] failed to update node lease, error: Put " dial tcp 172.31.2 50.21:16443: connect: connection refused Mar 10 10:37:04 UAT-K8S-MASTER01 kubelet: E0310 10:37:04.574123 20272 controller.go:178] failed to update node lease, error: Put " dial tcp 172.31.2 50.21:16443: connect: connection refused Mar 10 10:37:04 UAT-K8S-MASTER01 kubelet: I0310 10:37:04.574138 20272 controller.go:106] failed to update lease using latest lease, fallback to ensure lease, err: failed 5 attempts to update node lease Mar 10 10:37:04 UAT-K8S-MASTER01 kubelet: E0310 10:37:04.574228 20272 controller.go:136] failed to ensure node lease exists, will retry in 200ms, error: Get " eout=10s": dial tcp 172.31.250.21:16443: connect: connection refused Mar 10 10:37:04 UAT-K8S-MASTER01 kubelet: E0310 10:37:04.774499 20272 controller.go:136] failed to ensure node lease exists, will retry in 400ms, error: Get " eout=10s": dial tcp 172.31.250.21:16443: connect: connection refused Mar 10 10:37:05 UAT-K8S-MASTER01 kubelet: E0310 10:37:05.174927 20272 controller.go:136] failed to ensure node lease exists, will retry in 800ms, error: Get " eout=10s": dial tcp 172.31.250.21:16443: connect: connection refused Mar 10 10:37:05 UAT-K8S-MASTER01 kubelet: E0310 10:37:05.314735 20272 kubelet_node_status.go:442] Error updating node status, will retry: error getting node "uat-k8s-master01": Get " s": dial tcp 172.31.250.21:16443: connect: connection refused Mar 10 10:37:05 UAT-K8S-MASTER01 kubelet: E0310 10:37:05.315321 20272 kubelet_node_status.go:442] Error updating node status, will retry: error getting node "uat-k8s-master01": Get " dial tcp 172.3 1.250.21:16443: connect: connection refused Mar 10 10:37:05 UAT-K8S-MASTER01 kubelet: E0310 10:37:05.315477 20272 kubelet_node_status.go:442] Error updating node status, will retry: error getting node "uat-k8s-master01": Get " dial tcp 172.3 1.250.21:16443: connect: connection refused Mar 10 10:37:05 UAT-K8S-MASTER01 kubelet: E0310 10:37:05.315616 20272 kubelet_node_status.go:442] Error updating node status, will retry: error getting node "uat-k8s-master01": Get " dial tcp 172.3 1.250.21:16443: connect: connection refused |
2. 查看防火墙状态详情:
|
[root@UAT-K8S-MASTER01 ~]# systemctl status firewalld ● firewalld.service - firewalld - dynamic firewall daemon Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled) Active: inactive (dead) since Wed 2021-03-10 10:37:04 CST; 1 day 4h ago <<防火墙关闭时间和messages日志信息时间记录一致 Docs: man:firewalld(1) Main PID: 1201 (code=exited, status=0/SUCCESS) Jan 12 09:37:35 UAT-K8S-MASTER01 firewalld[1201]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -t filter -F DOCKER-ISOLATION-STAGE-1' failed: iptables: No chain/target/match by that name. Jan 12 09:37:35 UAT-K8S-MASTER01 firewalld[1201]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -t filter -X DOCKER-ISOLATION-STAGE-1' failed: iptables: No chain/target/match by that name. Jan 12 09:37:35 UAT-K8S-MASTER01 firewalld[1201]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -t filter -F DOCKER-ISOLATION-STAGE-2' failed: iptables: No chain/target/match by that name. Jan 12 09:37:35 UAT-K8S-MASTER01 firewalld[1201]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -t filter -X DOCKER-ISOLATION-STAGE-2' failed: iptables: No chain/target/match by that name. Jan 12 09:37:35 UAT-K8S-MASTER01 firewalld[1201]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -t filter -F DOCKER-ISOLATION' failed: iptables: No chain/target/match by that name. Jan 12 09:37:35 UAT-K8S-MASTER01 firewalld[1201]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -t filter -X DOCKER-ISOLATION' failed: iptables: No chain/target/match by that name. Jan 12 09:37:35 UAT-K8S-MASTER01 firewalld[1201]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -D FORWARD -i docker0 -o docker0 -j DROP' failed: iptables: Bad rule (does a matching rule exist in that chain?). Jan 12 09:37:35 UAT-K8S-MASTER01 firewalld[1201]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -D FORWARD -i docker0 -o docker0 -j DROP' failed: iptables: Bad rule (does a matching rule exist in that chain?). Mar 10 10:37:03 UAT-K8S-MASTER01 systemd[1]: Stopping firewalld - dynamic firewall daemon... Mar 10 10:37:04 UAT-K8S-MASTER01 systemd[1]: Stopped firewalld - dynamic firewall daemon. <<防火墙关闭时间3月10日10时37分04秒 |
2.2、问题原因
从/var/log/messages日志发现当天上午10时36分18秒,有用户登录root账户,并且在10时37分04秒使用关闭防火墙命令,把UAT-K8S-MASTER01上的防火墙关闭了(经过查证,发现其他节点上的防火墙也都被关闭了),关闭防火墙后,messages日志信息马上出现IPVS调度器scheduler未注册的日志信息,并且出现IPVS调度器卸载的日志信息,然后日志开始出现172.31.250.21:16443的拒绝连接报错信息,并一直持续打印该报错信息,导致集群所有节点无法创建新资源,且状态显示为NotReady状态。
3、问题总结与建议
3.1、总结
由于人为误操作将kubernets集群宿主机防火墙服务关闭了,最终导致集群节点上面的组件无法经过防火墙策略互通,从而影响集群整体运作甚至集群节点整体down机。当防火墙服务关闭后重启kublet节点恢复了kubernets集群,此时节点通讯不走防火墙策略来控制。这种情况虽然能保障节点之间通讯,但会影响kubernets内部服务之间DNS解析,建议恢复防火墙策略保持原有环境配置。
3.2、建议
1、 目前UAT环境的监控尚不完善,需把各个节点或组件的重要指标纳入监控平台,并配置相应告警以及通知信息(邮箱、短信等),辅以第一时间获得集群性能和运行的整体信息。
2、 Root用户的管控和人员登录的管控,例如:是否存在别的部门人员拥有登录root用户的权限;需要把控好root的登录权限,避免出现类似情况的发生。
3、 目前所有节点上的防火墙仍然时关闭状态,由于当前k8s集群之间组件通讯都是通过防火墙网络策略控制,关闭了防火墙会影响kubernets内部服务之间DNS解析,建议重新开启防火墙服务。