前言
/02
抓包分析
cilium config debug=true
: 开启debug。cilium config MonitorAggregationLevel=None
:关闭事件过滤。所有事件都输出
抓包脚本如下⬇️
cilium monitor -j > cil.log &
cil=$!
sleep 5 # wait monitor run
echo "cil run ok"
nsenter -n -t 4952 bash -c "ping -c 1 10.10.40.222" # 进入容器测试
echo "$cil"
kill $cil复制
正常情况下的数据包流程
为对比正常与异常情况的区别,先从正常数据包进行分析,相关日志有点多,可自行测试:(如果节点事件太多,可能会出现事件缺失),部分关键日志可参考如下⬇️
{"cpu":"CPU 07:","type":"trace","mark":"0x0","ifindex":"0","state":"unknown","observationPoint":"from-endpoint","traceSummary":"\u003c- endpoint 1177","source":1177,"bytes":98,"srcLabel":50600,"dstLabel":0,"dstID":0,"summary":{"ethernet":"Ethernet\t{Contents=[..14..] Payload=[..86..] SrcMAC=5e:f0:f4:59:43:10 DstMAC=d2:be:ba:0b:21:63 EthernetType=IPv4 Length=0}","ipv4":"IPv4\t{Contents=[..20..] Payload=[..64..] Version=4 IHL=5 TOS=0 Length=84 Id=65482 Flags=DF FragOffset=0 TTL=64 Protocol=ICMPv4 Checksum=23459 SrcIP=171.0.1.83 DstIP=10.10.40.222 Options=[] Padding=[]}","icmpv4":"ICMPv4\t{Contents=[..8..] Payload=[..56..] TypeCode=EchoRequest Checksum=6019 Id=52535 Seq=1}","l2":{"src":"5e:f0:f4:59:43:10","dst":"d2:be:ba:0b:21:63"},"l3":{"src":"171.0.1.83","dst":"10.10.40.222"}}}
{"cpu":"CPU 07:","type":"trace","mark":"0x0","ifindex":"0","state":"new","observationPoint":"to-stack","traceSummary":"-\u003e stack","source":1177,"bytes":98,"srcLabel":50600,"dstLabel":2,"dstID":0,"summary":{"ethernet":"Ethernet\t{Contents=[..14..] Payload=[..86..] SrcMAC=5e:f0:f4:59:43:10 DstMAC=d2:be:ba:0b:21:63 EthernetType=IPv4 Length=0}","ipv4":"IPv4\t{Contents=[..20..] Payload=[..64..] Version=4 IHL=5 TOS=0 Length=84 Id=65482 Flags=DF FragOffset=0 TTL=63 Protocol=ICMPv4 Checksum=23715 SrcIP=171.0.1.83 DstIP=10.10.40.222 Options=[] Padding=[]}","icmpv4":"ICMPv4\t{Contents=[..8..] Payload=[..56..] TypeCode=EchoRequest Checksum=6019 Id=52535 Seq=1}","l2":{"src":"5e:f0:f4:59:43:10","dst":"d2:be:ba:0b:21:63"},"l3":{"src":"171.0.1.83","dst":"10.10.40.222"}}}
{"cpu":"CPU 07:","type":"capture","mark":"0x0","message":"Delivery to ifindex 0","prefix":"-\u003e 0","source":1177,"bytes":98,"summary":"171.0.1.83 -\u003e 10.10.40.222 EchoRequest"}
{"cpu":"CPU 02:","type":"trace","mark":"0x0","ifindex":"0","state":"unknown","observationPoint":"to-network","traceSummary":"-\u003e network","source":125,"bytes":98,"srcLabel":0,"dstLabel":0,"dstID":0,"summary":{"ethernet":"Ethernet\t{Contents=[..14..] Payload=[..86..] SrcMAC=52:54:00:ab:ac:be DstMAC=0c:c4:7a:68:fa:ee EthernetType=IPv4 Length=0}","ipv4":"IPv4\t{Contents=[..20..] Payload=[..64..] Version=4 IHL=5 TOS=0 Length=84 Id=65482 Flags=DF FragOffset=0 TTL=62 Protocol=ICMPv4 Checksum=55265 SrcIP=10.10.40.11 DstIP=10.10.40.222 Options=[] Padding=[]}","icmpv4":"ICMPv4\t{Contents=[..8..] Payload=[..56..] TypeCode=EchoRequest Checksum=6019 Id=52535 Seq=1}","l2":{"src":"52:54:00:ab:ac:be","dst":"0c:c4:7a:68:fa:ee"},"l3":{"src":"10.10.40.11","dst":"10.10.40.222"}}}
# 应答
{"cpu":"CPU 06:","type":"trace","mark":"0xc1d67621","ifindex":"eth0","state":"unknown","observationPoint":"from-network","traceSummary":"\u003c- network","source":125,"bytes":98,"srcLabel":0,"dstLabel":0,"dstID":0,"summary":{"ethernet":"Ethernet\t{Contents=[..14..] Payload=[..86..] SrcMAC=0c:c4:7a:68:fa:ee DstMAC=52:54:00:ab:ac:be EthernetType=IPv4 Length=0}","ipv4":"IPv4\t{Contents=[..20..] Payload=[..64..] Version=4 IHL=5 TOS=0 Length=84 Id=31538 Flags= FragOffset=0 TTL=64 Protocol=ICMPv4 Checksum=39546 SrcIP=10.10.40.222 DstIP=10.10.40.11 Options=[] Padding=[]}","icmpv4":"ICMPv4\t{Contents=[..8..] Payload=[..56..] TypeCode=EchoReply Checksum=8067 Id=52535 Seq=1}","l2":{"src":"0c:c4:7a:68:fa:ee","dst":"52:54:00:ab:ac:be"},"l3":{"src":"10.10.40.222","dst":"10.10.40.11"}}}
{"cpu":"CPU 06:","type":"trace","mark":"0xc1d67621","ifindex":"eth0","state":"unknown","observationPoint":"from-host","traceSummary":"\u003c- host","source":125,"bytes":98,"srcLabel":2,"dstLabel":0,"dstID":0,"summary":{"ethernet":"Ethernet\t{Contents=[..14..] Payload=[..86..] SrcMAC=e2:09:4c:b7:53:8a DstMAC=e2:09:4c:b7:53:8a EthernetType=IPv4 Length=0}","ipv4":"IPv4\t{Contents=[..20..] Payload=[..64..] Version=4 IHL=5 TOS=0 Length=84 Id=6169 Flags= FragOffset=0 TTL=63 Protocol=ICMPv4 Checksum=33877 SrcIP=10.10.40.222 DstIP=171.0.1.83 Options=[] Padding=[]}","icmpv4":"ICMPv4\t{Contents=[..8..] Payload=[..56..] TypeCode=EchoReply Checksum=6721 Id=10339 Seq=1}","l2":{"src":"e2:09:4c:b7:53:8a","dst":"e2:09:4c:b7:53:8a"},"l3":{"src":"10.10.40.222","dst":"171.0.1.83"}}}
{"cpu":"CPU 06:","type":"debug","message":"Attempting local delivery for container id 1177 from seclabel 2"}
{"cpu":"CPU 06:","type":"trace","mark":"0xc1d67621","ifindex":"lxc27709feb57e9","state":"reply","observationPoint":"to-endpoint","traceSummary":"-\u003e endpoint 1177","source":1177,"bytes":98,"srcLabel":2,"dstLabel":50600,"dstID":1177,"summary":{"ethernet":"Ethernet\t{Contents=[..14..] Payload=[..86..] SrcMAC=d2:be:ba:0b:21:63 DstMAC=5e:f0:f4:59:43:10 EthernetType=IPv4 Length=0}","ipv4":"IPv4\t{Contents=[..20..] Payload=[..64..] Version=4 IHL=5 TOS=0 Length=84 Id=31538 Flags= FragOffset=0 TTL=62 Protocol=ICMPv4 Checksum=8764 SrcIP=10.10.40.222 DstIP=171.0.1.83 Options=[] Padding=[]}","icmpv4":"ICMPv4\t{Contents=[..8..] Payload=[..56..] TypeCode=EchoReply Checksum=8067 Id=52535 Seq=1}","l2":{"src":"d2:be:ba:0b:21:63","dst":"5e:f0:f4:59:43:10"},"l3":{"src":"10.10.40.222","dst":"171.0.1.83"}}}复制
将日志梳理成流程图,如下:

请求处理流程⬆️
主机路由规则如下⬇️
[root@c7-2 ~]# ip r
default via 10.10.40.254 dev eth0 proto static metric 100
10.10.40.0/24 dev eth0 proto kernel scope link src 10.10.40.11 metric 100
171.0.0.0/24 via 171.0.1.150 dev cilium_host src 171.0.1.150 mtu 1450
171.0.1.0/24 via 171.0.1.150 dev cilium_host src 171.0.1.150
171.0.1.150 dev cilium_host scope link
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown复制
当ping包从lxc网卡出来时,由于访问目的ip是 10.10.40.222,将基于默认路由从eth0出。触发了注册在eth0上的ebpf规则

应答处理流程⬆️
注意点
应答时,是直接从cilium_host网卡转到lxc网卡,从内核直接转,不会再经过上层,这也是为什么所有容器网络都路由到cilium_host的原因,同时也是为什么lxc网卡只注册ingress的原因
to-endpoint
对应的函数调用,并不是挂载在lxc网卡上的程序,而是通过ebpf尾调用实现的,这块是处理ingress policy相关逻辑,如果有网络策略,会在这个阶段拒绝
/03
异常下的数据处理流程
由于主机能收到应答(通过tcpdump),则重点关注应答过程。同样的方式cilium monitor,基于日志整理了流程图如下⬇️

对比流程图发现,正常回包时,from-network之后,是进入到cilum_host的处理流程:from-host。而异常时,在from-network之后,进入到了eth0的处理流程。
问题就比较明确了:主机路由出现了问题。
由于 ip r show 看路由没有看到明显的问题,但看路由规则后,发现了有异常⬇️
ip rule show
9: from all fwmark 0x200/0xf00 lookup 2004
10: from all fwmark 0xa00/0xf00 lookup 2005
100: from all lookup local
32760: from 10.10.40.222 lookup 100
32766: from all lookup main
32767: from all lookup default复制
对应的其实是策略路由⬇️
ip rule show table 100
32760: from 10.10.40.222 lookup 100
ip r show table 100
default via 10.10.40.254 dev eth0复制
策略路由中定义,源ip为 10.10.40.222 的数据包,将走路由表100,而路由表100中定义通过eth0发往网关。
虽然ping回包是应答包,但它正好满足源ip的条件,所以这个回包被转发到了eth0,无法再到达cilium_host,则cilium_host的匹配优先级比策略路由低。
为了验证这个问题,我在原有的路由规则基础上,添加一条高优先级的策略路由⬇️
ip rule add from 10.10.40.222 to 171.0.0.0/24 table main pref 20
复制
指定优先级为200,比策略路由规则32760要高,且限制了源ip与目的ip段,不会影响原策略路由,添加后,容器内访问该ip就正常了。
/03
展望
本次分析的为icmp包的处理,相比tcp更简单,而且cilium在socket层注册的ebpf规则都不会涉及,模型相对更简单。
以前对网络的理解比较简单,以为从哪来回哪去,但系统却不是这个逻辑。这也是为什么会有rp_filter的情况。而策略路由是为了解决从eth0请求10.10.40.222的数据包能从eth0回去,却不想影响了容器网络。不管是多网卡还是多虚拟网卡,策略路由都有它的作用,后续将针对策略路由深入分析。
添加策略路由步骤如下,大家可自行添加验证⬇️
ip r add 0/0 via 10.10.40.254 table 100
ip rule add from 10.10.40.222 table 100 pref 32760复制
本期作者丨沃趣科技产品研发部
版权作品,未经许可禁止转载
往期作品快速浏览:


