本篇博文主要研究的是 iptables 下的 K8s 服务暴露原理,下面的每一种暴露方式是层层递进的,位于下面的暴露方式依赖上面的方式。只要是集群节点都可以使用任意一种方式来访问服务(只要存在)
iptables
NAT 表入口:
-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING
-A POSTROUTING -s 10.244.0.0/16 -d 10.244.0.0/16 -j RETURN
-A POSTROUTING -s 10.244.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE
-A POSTROUTING ! -s 10.244.0.0/16 -d 10.244.0.0/24 -j RETURN
-A POSTROUTING ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -m mark --mark 0x4000/0x4000 -j MASQUERADE
FILTER 表:
-A INPUT -m conntrack --ctstate NEW -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A INPUT -m conntrack --ctstate NEW -m comment --comment "kubernetes externally-visible service portals" -j KUBE-EXTERNAL-SERVICES
-A INPUT -j KUBE-FIREWALL
-A FORWARD -m comment --comment "kubernetes forwarding rules" -j KUBE-FORWARD
-A FORWARD -m conntrack --ctstate NEW -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A OUTPUT -m conntrack --ctstate NEW -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A OUTPUT -j KUBE-FIREWALL
-A FORWARD -s 10.244.0.0/16 -j ACCEPT
-A FORWARD -d 10.244.0.0/16 -j ACCEPT
-A KUBE-FIREWALL -m comment --comment "kubernetes firewall for dropping marked packets" -m mark --mark 0x8000/0x8000 -j DROP
-A KUBE-FORWARD -m conntrack --ctstate INVALID -j DROP
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding rules" -m mark --mark 0x4000/0x4000 -j ACCEPT
-A KUBE-FORWARD -s 10.244.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A KUBE-FORWARD -d 10.244.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod destination rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A KUBE-SERVICES -d 10.100.18.91/32 -p tcp -m comment --comment "judge-system/judge-system-4-server-master-service: has no endpoints" -m tcp --dport 80 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 10.109.222.26/32 -p tcp -m comment --comment "judge-system/judge-system-4-server-development-service: has no endpoints" -m tcp --dport 80 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 10.109.183.27/32 -p tcp -m comment --comment "jhub/hub: has no endpoints" -m tcp --dport 8081 -j REJECT --reject-with icmp-port-unreachable
可以看到,对于一些没有 Endpoint 存在的 Service,会在 FILTER 表直接 REJECT,而其余的会进入后续的 CHAIN 进行进一步 NAT 处理。
Pod IP
Pod 一旦创建就存在,相当于一个虚拟机的 IP,可以进行 ping 等操作,其实现是通过网络插件+路由表(我们用的 flannel 就是 VxLAN 技术创建了覆盖网络)来将流量转发到正确的容器中。
IP 不固定,而且没有负载均衡。如果从 Pod 外访问 Pod IP,需要做 SNAT:
-A POSTROUTING ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE
Cluster IP
创建 Service 后就存在,通过 iptables 或者 ipvs 实现,只能转发 TCP 或 UDP 的指定端口(yaml 中规定的),最终会将 Cluster IP 的流量转发到某一个 Pod IP。
如果使用了 iptables 那么规则大概长下面这样,就是匹配 IP+端口 组合,然后转发
-A KUBE-SERVICES ! -s 10.244.0.0/16 -d 10.96.7.210/32 -p tcp -m comment --comment "redis/redis:client cluster IP" -m tcp --dport 6379 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.96.7.210/32 -p tcp -m comment --comment "redis/redis:client cluster IP" -m tcp --dport 6379 -j KUBE-SVC-X65RDOJUUA3VPRY4
-A KUBE-SVC-X65RDOJUUA3VPRY4 -m statistic --mode random --probability 0.16667000018 -j KUBE-SEP-MWI2C5QF6IXJQU63
-A KUBE-SVC-X65RDOJUUA3VPRY4 -m statistic --mode random --probability 0.20000000019 -j KUBE-SEP-FXRXESSRETWGI65E
-A KUBE-SVC-X65RDOJUUA3VPRY4 -m statistic --mode random --probability 0.25000000000 -j KUBE-SEP-66HFVIWNCINZQPZR
-A KUBE-SVC-X65RDOJUUA3VPRY4 -m statistic --mode random --probability 0.33332999982 -j KUBE-SEP-ZH23ZMDYUP2DO5GI
-A KUBE-SVC-X65RDOJUUA3VPRY4 -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-BAPRIFMT5YYOA3WI
-A KUBE-SVC-X65RDOJUUA3VPRY4 -j KUBE-SEP-ILU3WXFIDU2MOAD4
# 下面就是将 Cluster IP DNAT 为 Pod IP
-A KUBE-SEP-BAPRIFMT5YYOA3WI -s 10.244.5.216/32 -j KUBE-MARK-MASQ
-A KUBE-SEP-BAPRIFMT5YYOA3WI -p tcp -m tcp -j DNAT --to-destination 10.244.5.216:6379
# ... 省略了
如果需要将服务暴露给集群以外的机器,可以使用下面的服务暴露方式,原理仍然是设置对应的 iptables
Node Port
在每个机器上都会打开一个端口,然后将这个端口的流量转发到对应的 Service:
-A KUBE-NODEPORTS -p tcp -m comment --comment "rabbitmq/rabbitmq:http" -m tcp --dport 30376 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "rabbitmq/rabbitmq:http" -m tcp --dport 30376 -j KUBE-SVC-35VHXOLOORONIWYJ
External IP(Load Balancer)
LoadBalancer 类型的 Service 可以指定 External IP,这也是云服务商的云控制器原理。如果自建集群,可以手动指定 External IP,然后想办法使得 IP 包路由到机器上,或者参考这篇博文来自建一个 Load Balancer 控制器。
iptables 规则跟 Cluster IP 类似:
-A KUBE-SERVICES -d 192.168.15.8/32 -p tcp -m comment --comment "seaweedfs/filer:http loadbalancer IP" -m tcp --dport 80 -j KUBE-FW-BRYNYVZ5Q2DDPJIY
-A KUBE-FW-BRYNYVZ5Q2DDPJIY -m comment --comment "seaweedfs/filer:http loadbalancer IP" -j KUBE-MARK-MASQ
-A KUBE-FW-BRYNYVZ5Q2DDPJIY -m comment --comment "seaweedfs/filer:http loadbalancer IP" -j KUBE-SVC-BRYNYVZ5Q2DDPJIY
-A KUBE-FW-BRYNYVZ5Q2DDPJIY -m comment --comment "seaweedfs/filer:http loadbalancer IP" -j KUBE-MARK-DROP
如果指定了 trafficPolicy 为 local,不会做 SNAT,而且在本机没有 Pod 的情况下,会 Drop 掉转发流量。
-A KUBE-SERVICES -d 192.168.15.1/32 -p tcp -m comment --comment "kube-system/traefik-load-balance-service:web loadbalancer IP" -m tcp --dport 80 -j KUBE-FW-VZFKBL2WWL2EQZ6G
-A KUBE-FW-VZFKBL2WWL2EQZ6G -m comment --comment "kube-system/traefik-load-balance-service:web loadbalancer IP" -j KUBE-XLB-VZFKBL2WWL2EQZ6G
-A KUBE-FW-VZFKBL2WWL2EQZ6G -m comment --comment "kube-system/traefik-load-balance-service:web loadbalancer IP" -j KUBE-MARK-DROP
-A KUBE-XLB-OZONKQKEZE5YW3UK -m comment --comment "masquerade LOCAL traffic for kube-system/traefik-load-balance-service:https LB IP" -m addrtype --src-type LOCAL -j KUBE-MARK-MASQ
-A KUBE-XLB-OZONKQKEZE5YW3UK -m comment --comment "route LOCAL traffic for kube-system/traefik-load-balance-service:https LB IP to service chain" -m addrtype --src-type LOCAL -j KUBE-SVC-OZONKQKEZE5YW3UK
-A KUBE-XLB-OZONKQKEZE5YW3UK -s 10.244.0.0/16 -m comment --comment "Redirect pods trying to reach external loadbalancer VIP to clusterIP" -j KUBE-SVC-OZONKQKEZE5YW3UK
-A KUBE-XLB-OZONKQKEZE5YW3UK -m comment --comment "kube-system/traefik-load-balance-service:https has no local endpoints" -j KUBE-MARK-DROP
有 Pod 则不会 Drop:
-A KUBE-SERVICES -d 192.168.15.1/32 -p tcp -m comment --comment "kube-system/traefik-load-balance-service:web loadbalancer IP" -m tcp --dport 80 -j KUBE-FW-VZFKBL2WWL2EQZ6G
-A KUBE-FW-VZFKBL2WWL2EQZ6G -m comment --comment "kube-system/traefik-load-balance-service:web loadbalancer IP" -j KUBE-XLB-VZFKBL2WWL2EQZ6G
-A KUBE-FW-VZFKBL2WWL2EQZ6G -m comment --comment "kube-system/traefik-load-balance-service:web loadbalancer IP" -j KUBE-MARK-DROP
-A KUBE-XLB-VZFKBL2WWL2EQZ6G -m comment --comment "masquerade LOCAL traffic for kube-system/traefik-load-balance-service:web LB IP" -m addrtype --src-type LOCAL -j KUBE-MARK-MASQ
-A KUBE-XLB-VZFKBL2WWL2EQZ6G -m comment --comment "route LOCAL traffic for kube-system/traefik-load-balance-service:web LB IP to service chain" -m addrtype --src-type LOCAL -j KUBE-SVC-VZFKBL2WWL2EQZ6G
-A KUBE-XLB-VZFKBL2WWL2EQZ6G -s 10.244.0.0/16 -m comment --comment "Redirect pods trying to reach external loadbalancer VIP to clusterIP" -j KUBE-SVC-VZFKBL2WWL2EQZ6G
-A KUBE-XLB-VZFKBL2WWL2EQZ6G -m comment --comment "Balancing rule 0 for kube-system/traefik-load-balance-service:web" -j KUBE-SEP-EUB52TCTNKDY6XP4
Host Port
直接在 Host 上占用掉 Pod 要占用的端口,会加入规则做 NAT。