kubernets集群节点NotReady故障 分析报告


1、问题描述

客户反馈UAT测试环境的kubernets集群节点状态NotReady,无法创建新资源,影响到了厂家业务部署测试。为了不影响业务厂家应用发布测试,现场工程师通过重启kublet与docker容器服务来恢复kubernets集群。对该UAT的Kubernetes集群故障时间范围跟踪分析。具体故障信息如下图所示。

2、问题分析
2.1、分析过程

1.    查看/var/log/messages日志信息详情:

Mar 10 10:36:18 UAT-K8S-MASTER01 systemd:   Started Session 26305 of user   root.

Mar 10 10:36:18 UAT-K8S-MASTER01   systemd-logind: New session 26305 of user root.      <<登录root用户

Mar 10   10:36:19 UAT-K8S-MASTER01 kubelet: E0310 10:36:19.780889   20272 file.go:182] Not recursing into   manifest path "/etc/kubernetes/manifests/keepalived"

Mar 10   10:36:19 UAT-K8S-MASTER01 kubelet: E0310 10:36:19.783096   20272 file.go:182] Not recursing into   manifest path "/etc/kubernetes/manifests/nginx-lb"

Mar 10   10:36:39 UAT-K8S-MASTER01 kubelet: E0310 10:36:39.780897   20272 file.go:182] Not recursing into   manifest path "/etc/kubernetes/manifests/keepalived"

Mar 10   10:36:39 UAT-K8S-MASTER01 kubelet: E0310 10:36:39.783123   20272 file.go:182] Not recursing into   manifest path "/etc/kubernetes/manifests/nginx-lb"

Mar 10   10:36:59 UAT-K8S-MASTER01 kubelet: E0310 10:36:59.780902   20272 file.go:182] Not recursing into   manifest path "/etc/kubernetes/manifests/keepalived"

Mar 10   10:36:59 UAT-K8S-MASTER01 kubelet: E0310 10:36:59.783046   20272 file.go:182] Not recursing into   manifest path "/etc/kubernetes/manifests/nginx-lb"

Mar 10 10:37:03 UAT-K8S-MASTER01   systemd: Stopping firewalld - dynamic firewall daemon...     << 关闭防火墙操作

Mar 10 10:37:03 UAT-K8S-MASTER01   kernel: IPVS: [sh] scheduler unregistered.    <

Mar 10 10:37:03 UAT-K8S-MASTER01   kernel: IPVS: [wrr] scheduler unregistered.

Mar 10 10:37:03 UAT-K8S-MASTER01   kernel: IPVS: [rr] scheduler unregistered.

Mar 10 10:37:03 UAT-K8S-MASTER01   kernel: IPVS: ipvs unloaded. // 并且出现IPVS模块卸载的日志信息

Mar 10 10:37:04 UAT-K8S-MASTER01   systemd: Stopped firewalld - dynamic firewall daemon.     

Mar 10 10:37:04 UAT-K8S-MASTER01   kubelet: E0310 10:37:04.573494   20272   controller.go:178] failed to update node lease, error: Put   "   dial tcp 172.31.2

50.21:16443: connect: connection   refused     <<开始出现拒绝连接报错,并一直连续报错

Mar 10   10:37:04 UAT-K8S-MASTER01 kubelet: E0310 10:37:04.573696   20272 controller.go:178] failed to update   node lease, error: Put   "   dial tcp 172.31.2

50.21:16443:   connect: connection refused

Mar 10   10:37:04 UAT-K8S-MASTER01 kubelet: E0310 10:37:04.573881   20272 controller.go:178] failed to update   node lease, error: Put   "   dial tcp 172.31.2

50.21:16443:   connect: connection refused

Mar 10   10:37:04 UAT-K8S-MASTER01 kubelet: E0310 10:37:04.574009   20272 controller.go:178] failed to update   node lease, error: Put   "   dial tcp 172.31.2

50.21:16443:   connect: connection refused

Mar 10   10:37:04 UAT-K8S-MASTER01 kubelet: E0310 10:37:04.574123   20272 controller.go:178] failed to update   node lease, error: Put "   dial tcp 172.31.2

50.21:16443:   connect: connection refused

Mar 10   10:37:04 UAT-K8S-MASTER01 kubelet: I0310 10:37:04.574138   20272 controller.go:106] failed to update   lease using latest lease, fallback to ensure lease, err: failed 5 attempts to   update node lease

Mar 10   10:37:04 UAT-K8S-MASTER01 kubelet: E0310 10:37:04.574228   20272 controller.go:136] failed to ensure   node lease exists, will retry in 200ms, error: Get "

eout=10s":   dial tcp 172.31.250.21:16443: connect: connection refused

Mar 10   10:37:04 UAT-K8S-MASTER01 kubelet: E0310 10:37:04.774499   20272 controller.go:136] failed to ensure   node lease exists, will retry in 400ms, error: Get   "

eout=10s":   dial tcp 172.31.250.21:16443: connect: connection refused

Mar 10   10:37:05 UAT-K8S-MASTER01 kubelet: E0310 10:37:05.174927   20272 controller.go:136] failed to ensure   node lease exists, will retry in 800ms, error: Get   "

eout=10s":   dial tcp 172.31.250.21:16443: connect: connection refused

Mar 10   10:37:05 UAT-K8S-MASTER01 kubelet: E0310 10:37:05.314735   20272 kubelet_node_status.go:442] Error   updating node status, will retry: error getting node "uat-k8s-master01":   Get   "

s":   dial tcp 172.31.250.21:16443: connect: connection refused

Mar 10   10:37:05 UAT-K8S-MASTER01 kubelet: E0310 10:37:05.315321   20272 kubelet_node_status.go:442] Error   updating node status, will retry: error getting node   "uat-k8s-master01": Get   "   dial tcp 172.3

1.250.21:16443:   connect: connection refused

Mar 10   10:37:05 UAT-K8S-MASTER01 kubelet: E0310 10:37:05.315477   20272 kubelet_node_status.go:442] Error   updating node status, will retry: error getting node   "uat-k8s-master01": Get "   dial tcp 172.3

1.250.21:16443:   connect: connection refused

Mar 10   10:37:05 UAT-K8S-MASTER01 kubelet: E0310 10:37:05.315616   20272 kubelet_node_status.go:442] Error   updating node status, will retry: error getting node   "uat-k8s-master01": Get "   dial tcp 172.3

1.250.21:16443:   connect: connection refused

  2.    查看防火墙状态详情:

[root@UAT-K8S-MASTER01 ~]# systemctl status firewalld

● firewalld.service -   firewalld - dynamic firewall daemon

   Loaded: loaded   (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)

   Active: inactive (dead) since Wed 2021-03-10 10:37:04 CST; 1   day 4h ago    <<防火墙关闭时间和messages日志信息时间记录一致

     Docs: man:firewalld(1)

 Main PID: 1201 (code=exited,   status=0/SUCCESS)

Jan 12 09:37:35 UAT-K8S-MASTER01   firewalld[1201]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -t filter   -F DOCKER-ISOLATION-STAGE-1' failed: iptables: No chain/target/match by that   name.

Jan 12 09:37:35 UAT-K8S-MASTER01   firewalld[1201]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -t filter   -X DOCKER-ISOLATION-STAGE-1' failed: iptables: No chain/target/match by that   name.

Jan 12 09:37:35 UAT-K8S-MASTER01   firewalld[1201]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -t filter   -F DOCKER-ISOLATION-STAGE-2' failed: iptables: No chain/target/match by that   name.

Jan 12 09:37:35 UAT-K8S-MASTER01   firewalld[1201]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -t filter   -X DOCKER-ISOLATION-STAGE-2' failed: iptables: No chain/target/match by that   name.

Jan 12 09:37:35 UAT-K8S-MASTER01   firewalld[1201]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -t filter   -F DOCKER-ISOLATION' failed: iptables: No chain/target/match by that name.

Jan 12 09:37:35 UAT-K8S-MASTER01   firewalld[1201]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -t filter   -X DOCKER-ISOLATION' failed: iptables: No chain/target/match by that name.

Jan 12 09:37:35 UAT-K8S-MASTER01   firewalld[1201]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -D FORWARD   -i docker0 -o docker0 -j DROP' failed: iptables: Bad rule (does a matching   rule exist in that chain?).

Jan 12 09:37:35 UAT-K8S-MASTER01   firewalld[1201]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -D FORWARD   -i docker0 -o docker0 -j DROP' failed: iptables: Bad rule (does a matching   rule exist in that chain?).

Mar 10 10:37:03 UAT-K8S-MASTER01   systemd[1]: Stopping firewalld - dynamic firewall daemon...

Mar 10 10:37:04 UAT-K8S-MASTER01 systemd[1]: Stopped firewalld   - dynamic firewall daemon.      <<防火墙关闭时间3月10日10时37分04秒


2.2、问题原因

     从/var/log/messages日志发现当天上午10时36分18秒,有用户登录root账户,并且在10时37分04秒使用关闭防火墙命令,把UAT-K8S-MASTER01上的防火墙关闭了(经过查证,发现其他节点上的防火墙也都被关闭了),关闭防火墙后,messages日志信息马上出现IPVS调度器scheduler未注册的日志信息,并且出现IPVS调度器卸载的日志信息,然后日志开始出现172.31.250.21:16443的拒绝连接报错信息,并一直持续打印该报错信息,导致集群所有节点无法创建新资源,且状态显示为NotReady状态。

3、问题总结与建议
3.1、总结

由于人为误操作将kubernets集群宿主机防火墙服务关闭了,最终导致集群节点上面的组件无法经过防火墙策略互通,从而影响集群整体运作甚至集群节点整体down机。当防火墙服务关闭后重启kublet节点恢复了kubernets集群,此时节点通讯不走防火墙策略来控制。这种情况虽然能保障节点之间通讯,但会影响kubernets内部服务之间DNS解析,建议恢复防火墙策略保持原有环境配置。

3.2、建议
1、 目前UAT环境的监控尚不完善,需把各个节点或组件的重要指标纳入监控平台,并配置相应告警以及通知信息(邮箱、短信等),辅以第一时间获得集群性能和运行的整体信息。
2、 Root用户的管控和人员登录的管控,例如:是否存在别的部门人员拥有登录root用户的权限;需要把控好root的登录权限,避免出现类似情况的发生。
3、 目前所有节点上的防火墙仍然时关闭状态,由于当前k8s集群之间组件通讯都是通过防火墙网络策略控制,关闭了防火墙会影响kubernets内部服务之间DNS解析,建议重新开启防火墙服务。


请使用浏览器的分享功能分享到微信等