记一次kubernetes v1.18.0 svc名称空间地址解析异常问题及解决

post thumb
Kubernetes
作者 Louis 发表于 2020年6月15日

[TOC]

安装kubenetes1.18.0集群后,一次断电重启后, 在集群内解释不了内部命名空间等地址, 很奇怪, 这篇文章仅记录处理问题过程。

问题发现

安装监控系统Prometheus, 结合grafana图标展示, 后端一直报502. 很纳闷, 到数据源管理页面查看,prometheus数据一直加载不出来, http://prometheus-k8s.monitoring.svc:9090 点击测试test按钮, 一直显示500错误。 才想起来是内部的pod地址解析问题。

由于博主grafanaPrometheus都安装了ingress,单体访问起来都没有问题。 所以也没有考虑grafana里面的数据问题, 直到我看了监控后,才发现数据源一直异常, 晕死~~

解决思路

  • 查看集群的svc名称解析是否可用
  • 定位到具体问题
$ curl -I 10.96.0.10:53
curl: (7) Failed connect to 10.96.0.10:53; No route to host

# coreDNS所在机器的

$ cat /etc/resolv.conf
# Generated by NetworkManager
nameserver 114.114.114.114

coredns连不上, 查看coredns相关节点是否异常

$ kubectl get pods -n kube-system -o wide  |grep coredns
coredns-66bff467f8-9hjfd                   1/1     Running   0          2d19h   100.87.166.173   server65       <none>           <none>
$ # kubectl get nodes -o wide
NAME           STATUS   ROLES    AGE     VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION               CONTAINER-RUNTIME
huohua-test    Ready    <none>   9d      v1.18.0   192.168.0.30   <none>        CentOS Linux 7 (Core)   3.10.0-1062.9.1.el7.x86_64   docker://19.3.0
k8s-master     Ready    master   9d      v1.18.0   192.168.0.31   <none>        CentOS Linux 7 (Core)   3.10.0-1062.el7.x86_64       docker://19.3.0
server65       Ready    <none>   5d18h   v1.18.0   192.168.0.65   <none>        CentOS Linux 7 (Core)   3.10.0-1062.el7.x86_64       docker://19.3.8
server88-new   Ready    <none>   6d      v1.18.0   192.168.0.88   <none>        CentOS Linux 7 (Core)   3.10.0-1062.el7.x86_64       docker://19.3.8

查看coredns 日志,发现日志也没有什么异常。

$  kubectl  logs -f coredns-66bff467f8-9hjfd -n kube-system 
.:53
[INFO] plugin/reload: Running configuration MD5 = 4e235fcc3696966e76816bcd9034ebc7
CoreDNS-1.6.7
linux/amd64, go1.13.6, da7f65b

直接去coredns所在宿主机相关日志, pod节点的metrics可以访问. 发现宿主机的ipvs没有自动指向pod的ip, 也就是ipvs的规则没有自动更新.

$ ssh  server65 
$ curl -I 100.87.166.173:9153
HTTP/1.1 404 Not Found
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Mon, 15 Jun 2020 03:41:57 GMT
Content-Length: 19
$ ipvsadm -ln | grep -A3 53

TCP  10.96.0.10:53 rr
  -> 100.87.166.169:53            Masq    1      0          0         
TCP  10.96.0.10:9153 rr
  -> 100.87.166.172:9153          Masq    1      0          0         
TCP  10.98.146.64:8082 rr
  -> 100.70.101.177:8082          Masq    1      0          0         
TCP  10.98.252.71:80 rr
--
UDP  10.96.0.10:53 rr
  -> 100.87.166.169:53            Masq    1      0          69  
  
## pod目前的ip为 100.87.166.173
## ipvs的规则ip为 100.87.166.169
## 没有自动更新

检查kube-proxy日志, 发现大量parseIP Error . 查询github相关issue_k8s_#89520后,发现是 kubenetes集群v1.18.0版本的bug. 恢复kube-proxy至老版本v1.17.4即可

$ docker ps -a |grep kube-proxy
9dc2c99ede89        92f9a31ce92a                                             "/usr/local/bin/kube…"   2 days ago           Up 2 days                                                                     k8s_kube-proxy_kube-proxy-8vkrb_kube-system_acdff2a0-4efd-447c-922d-364feb508e9e_1
0cd79c83423f        k8s.gcr.io/pause:3.2                                     "/pause"                 2 days ago           Up 2 days                                                                     k8s_POD_kube-proxy-8vkrb_kube-system_acdff2a0-4efd-447c-922d-364feb508e9e_1
61f229d2248e        k8s.gcr.io/kube-proxy:v1.18.0                  "/usr/local/bin/kube…"   2 days ago           Up 2 days                                                         k8s_kube-proxy_kube-proxy-8vkrb_kube-system_acdff2a0-4efd-447c-922d-364feb508e9e_0
044d3e11dbb8        k8s.gcr.io/pause:3.2                                     "/pause"                 2 days ago           Up 2 days                                                          k8s_POD_kube-proxy-8vkrb_kube-system_acdff2a0-4efd-447c-922d-364feb508e9e_0

$ docker logs -f  61f
E0610 04:19:01.916711       1 proxier.go:1192] Failed to sync endpoint for service: 10.96.0.1:443/TCP, err: parseIP Error ip=[192 168 0 31 0 0 0 0 0 0 0 0 0 0 0 0]
E0610 04:19:01.916789       1 proxier.go:1950] Failed to list IPVS destinations, error: parseIP Error ip=[100 70 101 165 0 0 0 0 0 0 0 0 0 0 0 0]
E0610 04:19:01.916808       1 proxier.go:1192] Failed to sync endpoint for service: 10.103.137.114:80/TCP, err: parseIP Error ip=[100 70 101 165 0 0 0 0 0 0 0 0 0 0 0 0]
E0610 04:19:01.916881       1 proxier.go:1950] Failed to list IPVS destinations, error: parseIP Error ip=[100 116 59 71 0 0 0 0 0 0 0 0 0 0 0 0]
E0610 04:19:01.916899       1 proxier.go:1192] Failed to sync endpoint for service: 10.96.0.10:53/UDP, err: parseIP Error ip=[100 116 59 71 0 0 0 0 0 0 0 0 0 0 0 0]
E0610 04:19:01.916982       1 proxier.go:1950] Failed to list IPVS destinations, error: parseIP Error ip=[192 168 0 30 0 0 0 0 0 0 0 0 0 0 0 0]
E0610 04:19:01.917000       1 proxier.go:1192] Failed to sync endpoint for service: 10.110.140.83:443/TCP, err: parseIP Error ip=[192 168 0 30 0 0 0 0 0 0 0 0 0 0 0 0]
E0610 04:19:31.965080       1 proxier.go:1950] Failed to list IPVS destinations, error: parseIP Error ip=[100 70 101 162 0 0 0 0 0 0 0 0 0 0 0 0]
E0610 04:19:31.965114       1 proxier.go:1192] Failed to sync endpoint for service: 10.109.196.2:8082/TCP, err: parseIP Error ip=[100 70 101 162 0 0 0 0 0 0 0 0 0 0 0 0]
E0610 04:19:31.965215       1 proxier.go:1950] Failed to list IPVS destinations, error: parseIP Error ip=[100 70 101 163 0 0 0 0 0 0 0 0 0 0 0 0]
E0610 04:19:31.965235       1 proxier.go:1192] Failed to sync endpoint for service: 10.99.176.38:8080/TCP, err: parseIP Error ip=[100 70 101 163 0 0 0 0 0 0 0 0 0 0 0 0]^C
E0610 04:19:31.965359       1 proxier.go:1950] Failed to list IPVS destinations, error: parseIP Error ip=[100 74 166 76 0 0 0 0 0 0 0 0 0 0 0 0]
E0610 04:19:31.965384       1 proxier.go:1192] Failed to sync endpoint for service: 10.107.12.86:80/TCP, err: parseIP Error ip=[100 74 166 76 0 0 0 0 0 0 0 0 0 0 0 0]
E0610 04:19:31.965481       1 proxier.go:1950] Failed to list IPVS destinations, error: parseIP Error ip=[100 70 101 157 0 0 0 0 0 0 0 0 0 0 0 0]
E0610 04:19:31.965505       1 proxier.go:1192] Failed to sync endpoint for service: 10.110.119.128:8081/TCP, err: parseIP Error ip=[100 70 101 157 0 0 0 0 0 0 0 0 0 0 0 0]

恢复操作, 更改镜像后, 立马恢复了svc解析. 问题解决.

$ kubectl -n kube-system set image daemonset/kube-proxy *=registry.aliyuncs.com/k8sxio/kube-proxy:v1.17.6

$ dig @100.87.166.173 kubernetes.default.svc.cluster.local +short
10.96.0.1

参考

上一篇
k8s安装prometheus并持久化数据