[TOC]
安装
kubenetes1.18.0
集群后,一次断电重启后, 在集群内解释不了内部命名空间等地址, 很奇怪, 这篇文章仅记录处理问题过程。
问题发现
安装监控系统Prometheus
, 结合grafana
图标展示, 后端一直报502. 很纳闷, 到数据源管理页面查看,prometheus
数据一直加载不出来, http://prometheus-k8s.monitoring.svc:9090
点击测试test按钮, 一直显示500错误。 才想起来是内部的pod地址解析问题。
由于博主grafana
和Prometheus
都安装了ingress
,单体访问起来都没有问题。 所以也没有考虑grafana
里面的数据问题, 直到我看了监控后,才发现数据源一直异常, 晕死~~
解决思路
- 查看集群的svc名称解析是否可用
- 定位到具体问题
$ curl -I 10.96.0.10:53
curl: (7) Failed connect to 10.96.0.10:53; No route to host
# coreDNS所在机器的
$ cat /etc/resolv.conf
# Generated by NetworkManager
nameserver 114.114.114.114
coredns连不上, 查看coredns相关节点是否异常
$ kubectl get pods -n kube-system -o wide |grep coredns
coredns-66bff467f8-9hjfd 1/1 Running 0 2d19h 100.87.166.173 server65 <none> <none>
$ # kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
huohua-test Ready <none> 9d v1.18.0 192.168.0.30 <none> CentOS Linux 7 (Core) 3.10.0-1062.9.1.el7.x86_64 docker://19.3.0
k8s-master Ready master 9d v1.18.0 192.168.0.31 <none> CentOS Linux 7 (Core) 3.10.0-1062.el7.x86_64 docker://19.3.0
server65 Ready <none> 5d18h v1.18.0 192.168.0.65 <none> CentOS Linux 7 (Core) 3.10.0-1062.el7.x86_64 docker://19.3.8
server88-new Ready <none> 6d v1.18.0 192.168.0.88 <none> CentOS Linux 7 (Core) 3.10.0-1062.el7.x86_64 docker://19.3.8
查看coredns 日志,发现日志也没有什么异常。
$ kubectl logs -f coredns-66bff467f8-9hjfd -n kube-system
.:53
[INFO] plugin/reload: Running configuration MD5 = 4e235fcc3696966e76816bcd9034ebc7
CoreDNS-1.6.7
linux/amd64, go1.13.6, da7f65b
直接去coredns所在宿主机相关日志, pod节点的metrics可以访问. 发现宿主机的ipvs没有自动指向pod的ip, 也就是ipvs的规则没有自动更新.
$ ssh server65
$ curl -I 100.87.166.173:9153
HTTP/1.1 404 Not Found
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Mon, 15 Jun 2020 03:41:57 GMT
Content-Length: 19
$ ipvsadm -ln | grep -A3 53
TCP 10.96.0.10:53 rr
-> 100.87.166.169:53 Masq 1 0 0
TCP 10.96.0.10:9153 rr
-> 100.87.166.172:9153 Masq 1 0 0
TCP 10.98.146.64:8082 rr
-> 100.70.101.177:8082 Masq 1 0 0
TCP 10.98.252.71:80 rr
--
UDP 10.96.0.10:53 rr
-> 100.87.166.169:53 Masq 1 0 69
## pod目前的ip为 100.87.166.173
## ipvs的规则ip为 100.87.166.169
## 没有自动更新
检查kube-proxy日志, 发现大量parseIP Error
. 查询github相关issue_k8s_#89520后,发现是 kubenetes集群v1.18.0
版本的bug. 恢复kube-proxy至老版本v1.17.4
即可
$ docker ps -a |grep kube-proxy
9dc2c99ede89 92f9a31ce92a "/usr/local/bin/kube…" 2 days ago Up 2 days k8s_kube-proxy_kube-proxy-8vkrb_kube-system_acdff2a0-4efd-447c-922d-364feb508e9e_1
0cd79c83423f k8s.gcr.io/pause:3.2 "/pause" 2 days ago Up 2 days k8s_POD_kube-proxy-8vkrb_kube-system_acdff2a0-4efd-447c-922d-364feb508e9e_1
61f229d2248e k8s.gcr.io/kube-proxy:v1.18.0 "/usr/local/bin/kube…" 2 days ago Up 2 days k8s_kube-proxy_kube-proxy-8vkrb_kube-system_acdff2a0-4efd-447c-922d-364feb508e9e_0
044d3e11dbb8 k8s.gcr.io/pause:3.2 "/pause" 2 days ago Up 2 days k8s_POD_kube-proxy-8vkrb_kube-system_acdff2a0-4efd-447c-922d-364feb508e9e_0
$ docker logs -f 61f
E0610 04:19:01.916711 1 proxier.go:1192] Failed to sync endpoint for service: 10.96.0.1:443/TCP, err: parseIP Error ip=[192 168 0 31 0 0 0 0 0 0 0 0 0 0 0 0]
E0610 04:19:01.916789 1 proxier.go:1950] Failed to list IPVS destinations, error: parseIP Error ip=[100 70 101 165 0 0 0 0 0 0 0 0 0 0 0 0]
E0610 04:19:01.916808 1 proxier.go:1192] Failed to sync endpoint for service: 10.103.137.114:80/TCP, err: parseIP Error ip=[100 70 101 165 0 0 0 0 0 0 0 0 0 0 0 0]
E0610 04:19:01.916881 1 proxier.go:1950] Failed to list IPVS destinations, error: parseIP Error ip=[100 116 59 71 0 0 0 0 0 0 0 0 0 0 0 0]
E0610 04:19:01.916899 1 proxier.go:1192] Failed to sync endpoint for service: 10.96.0.10:53/UDP, err: parseIP Error ip=[100 116 59 71 0 0 0 0 0 0 0 0 0 0 0 0]
E0610 04:19:01.916982 1 proxier.go:1950] Failed to list IPVS destinations, error: parseIP Error ip=[192 168 0 30 0 0 0 0 0 0 0 0 0 0 0 0]
E0610 04:19:01.917000 1 proxier.go:1192] Failed to sync endpoint for service: 10.110.140.83:443/TCP, err: parseIP Error ip=[192 168 0 30 0 0 0 0 0 0 0 0 0 0 0 0]
E0610 04:19:31.965080 1 proxier.go:1950] Failed to list IPVS destinations, error: parseIP Error ip=[100 70 101 162 0 0 0 0 0 0 0 0 0 0 0 0]
E0610 04:19:31.965114 1 proxier.go:1192] Failed to sync endpoint for service: 10.109.196.2:8082/TCP, err: parseIP Error ip=[100 70 101 162 0 0 0 0 0 0 0 0 0 0 0 0]
E0610 04:19:31.965215 1 proxier.go:1950] Failed to list IPVS destinations, error: parseIP Error ip=[100 70 101 163 0 0 0 0 0 0 0 0 0 0 0 0]
E0610 04:19:31.965235 1 proxier.go:1192] Failed to sync endpoint for service: 10.99.176.38:8080/TCP, err: parseIP Error ip=[100 70 101 163 0 0 0 0 0 0 0 0 0 0 0 0]^C
E0610 04:19:31.965359 1 proxier.go:1950] Failed to list IPVS destinations, error: parseIP Error ip=[100 74 166 76 0 0 0 0 0 0 0 0 0 0 0 0]
E0610 04:19:31.965384 1 proxier.go:1192] Failed to sync endpoint for service: 10.107.12.86:80/TCP, err: parseIP Error ip=[100 74 166 76 0 0 0 0 0 0 0 0 0 0 0 0]
E0610 04:19:31.965481 1 proxier.go:1950] Failed to list IPVS destinations, error: parseIP Error ip=[100 70 101 157 0 0 0 0 0 0 0 0 0 0 0 0]
E0610 04:19:31.965505 1 proxier.go:1192] Failed to sync endpoint for service: 10.110.119.128:8081/TCP, err: parseIP Error ip=[100 70 101 157 0 0 0 0 0 0 0 0 0 0 0 0]
恢复操作, 更改镜像后, 立马恢复了svc解析. 问题解决.
$ kubectl -n kube-system set image daemonset/kube-proxy *=registry.aliyuncs.com/k8sxio/kube-proxy:v1.17.6
$ dig @100.87.166.173 kubernetes.default.svc.cluster.local +short
10.96.0.1