MetalLB in layer 2 mode

零壹陋室 2021-09-23

701

上篇文章《非云K8S集群安装配置Ingress Controller和MetalLB负载均衡》中提到要翻译MetalLB的官方文章，我的翻译中增加了我自己对该项技术内容的认识，所以我留下原英文，供读者参考。

本文翻译至metallb的官方文档

https://metallb.universe.tf/concepts/layer2/，所有权力归原文所有。

有些文字是我自己的思考，这些文字一律使用"kursk注："打头。

In layer 2 mode, one node assumes the responsibility of advertising a service to the local network. From the network’s perspective, it simply looks like that machine has multiple IP addresses assigned to its network interface.

在layer 2(kursk注：从上下文看，的确是指ISO网络模型的二层，以下统称为“二层模式”)模式下，一个节点承担着向本地网络发布服务的责任，从网络的角度来看，这种行为类似于这台机器有多个IP地址。

Under the hood, MetalLB responds to ARP requests for IPv4 services, and NDP requests for IPv6.

在这种情形下，MetalLB负责为IPv4的服务响应ARP请求(kursk注：通过三层ip地址查询二层MAC地址)，或者为IPv6服务响应NDP(kursk注：邻居发现协议，就是两台设备只要用网线联起来，不用设置任何参数，这两台设备就可以通过DNP互相知道对方的MAC，从而实现二层的联调)

The major advantage of the layer 2 mode is its universality: it will work on any Ethernet network, with no special hardware required, not even fancy routers.

二层模式的主要优点在于广泛的适用性：它在任何以太网络都可以工作，没有特别的硬件需求，不需要路由器。

Load-balancing behavior

负载均衡

In layer 2 mode, all traffic for a service IP goes to one node. From there, kube-proxy spreads the traffic to all the service’s pods.

在二层模式中，对一个服务IP的流量都打到一个节点上，这个节点上的kube-proxy 再扩散流量给该service所在的所有pods

In that sense, layer 2 does not implement a load balancer. Rather, it implements a failover mechanism so that a different node can take over should the current leader node fail for some reason.

这种情况下的二层模式没有实现负载均衡，但是实现了故障转移，当该领导节点因为某种原因失败时，其它的节点可以接过它的责任。

If the leader node fails for some reason, failover is automatic: the failed node is detected using memberlist , at which point new nodes take over ownership of the IP addresses from the failed node.

如果领导节点因为某种原因发生失败，故障转移是自动完成的——通过 memberlist组件自动探测到失败节点，然后新节点就接管了原来失败节点的ip地址

kursk注：上一篇文章《非云K8S集群安装配置Ingress Controller和MetalLB负载均衡》中弹性网卡绑定实例是错误，因为按照上文所说，这个ip属于领导节点(leader node)所有，而领导节点是可以改变的，所以弹性网卡不应该绑定某个ip，而应该通过DHCP服务获取，但是上一篇文章的环境由阿里云提供，阿里云的环境并没有DHCP服务，我尝试过在不绑定实例的情况下，用其它机器ping 该ip地址是ping不通的，所以只能采用代替做法，在实例上绑定IP地址。

kursk注：memberlist (https://github.com/hashicorp/memberlist)是一个基于go语言的类库，使用gossip 协议，对集群进行管理，并发现集群成员中的失败节点。

Limitations 限制

Layer 2 mode has two main limitations you should be aware of: single-node bottlenecking, and potentially slow failover.

二层模式有两个主要的限制：单点瓶颈和潜在慢故障恢复

As explained above, in layer2 mode a single leader-elected node receives all traffic for a service IP. This means that your service’s ingress bandwidth is limited to the bandwidth of a single node. This is a fundamental limitation of using ARP and NDP to steer traffic.

如上所述，在二层模式中一个领导节点对应一个服务，对该服务的所有请求都会打到这个领导节点上，这就意味服务的ingress的带宽的上限是这个领导节点的带宽上限，这是最根本的ARP和NDP的限制。

In the current implementation, failover between nodes depends on cooperation from the clients. When a failover occurs, MetalLB sends a number of gratuitous layer 2 packets (a bit of a misnomer - it should really be called “unsolicited layer 2 packets”) to notify clients that the MAC address associated with the service IP has changed.

在当前实现中，多个节点之间的故障转移依赖于客户端之间的合作，当一个故障转移发生时，MetalLB将发送一些二层的packets(难道不应该是frames吗？这里应该是3层或4层)，这些frames提醒客户端，服务ip的MAC地址已经改变

kursk注：难怪要用ARP协议，这里的过程应该是MetalLB通过发送gratuitous packets给客户端，让客户端发送ARP请求，获得服务IP新的MAC地址

Most operating systems handle “gratuitous” packets correctly, and update their neighbor caches promptly. In that case, failover happens within a few seconds. However, some systems either don’t implement gratuitous handling at all, or have buggy implementations that delay the cache update.

大多数操作系统都能正确处理gratuitous packets(kursk注：这里实在不知道怎么翻译gratuitous)，并且很快地更新邻居缓存，在很多案例中，故障转移都在几秒内完成。但是一些操作系统根本没有实现处理gratuitous packets的能力，或者实现代码中有bug导致更新邻居缓存延迟。

All modern versions of major OSes (Windows, Mac, Linux) implement layer 2 failover correctly, so the only situation where issues may happen is with older or less common OSes.

所有的现代主要操作系统(Windows, Mac, Linux)都正确地实现了二层故障转移，仅仅有一些比较少且古老的操作系统有问题。

To minimize the impact of planned failover on buggy clients, you should keep the old leader node up for a couple of minutes after flipping leadership, so that it can continue forwarding traffic for old clients until their caches refresh.

为了最小化有bug的客户端的故障转移的影响，你应该旧的领导节点在转交领导角色后，仍然保持几分钟的，以便它可以继续转发流量到旧客户端上，知道客户端上的缓存刷新

kursk注：我觉得MetalLB在耍流氓，都已经发生故障了，甚至可能是shutdown了，我怎么还保持旧领导节点几分钟？

During an unplanned failover, the service IPs will be unreachable until the buggy clients refresh their cache entries.

在非计划失败期间，服务ip一直是不可访问的，知道有bug的客户端刷新他们的缓存。

kursk注：也就是说上面是计划失败，所以旧领导节点还是可用的，所以可以保持几分钟，好吧，就不抬杠了

If you encounter a situation where layer 2 mode failover is slow (more than about 10s), please file a bug ! We can help you investigate and determine if the issue is with the client, or a bug in MetalLB.

如果你遇到了二层模式故障转移很慢的情况(超过10s)，请发bug到项目上，我们将帮助你研究和判断客户端发生问题的原因或bug。

Comparison to Keepalived

与Keepalived的比较

MetalLB’s layer2 mode has a lot of similarities to Keepalived, so if you’re familiar with Keepalived, this should all sound fairly familiar. However, there are also a few differences worth mentioning. If you aren’t familiar with Keepalived, you can skip this section.

MetalLB的二层模式与Keepalived有一些类似之处，如果你熟悉Keepalived，本节听起来就想到熟悉，但还是有一些的不同，如果你不了解Keepalived，你可以跳过本节。

kursk注：Keepalived是一种Linux的集群设置组件，介绍如下

https://www.keepalived.org/documentation.html

Keepalived uses the Virtual Router Redundancy Protocol (VRRP). Instances of Keepalived continuously exchange VRRP messages with each other, both to select a leader and to notice when that leader goes away.

Keepalived使用VRRP，Keepalived的各实例之间不断地交换VRRP消息，通过这种方式，既用来选举领导节点，也用来告知当前领导节点失去。

MetalLB on the other hand relies on memberlist to know when a node in the cluster is no longer reachable and the service IPs from that node should be moved elsewhere.

而MetalLB则是通过 memberlist 了解集群中的一个节点不再可被访问，这个节点的服务应该被漂移到其它节点。

Keepalived and MetalLB “look” the same from the client’s perspective: the service IP address seems to migrate from one machine to another when a failover occurs, and the rest of the time it just looks like machines have more than one IP address.

Keepalived和MetalLB都是从客户端的角度观察故障转移，服务IP似乎从一台机器转移到另一台机器，在剩余的时间似乎该机器有多个ip地址。

Because it doesn’t use VRRP, MetalLB isn’t subject to some of the limitations of that protocol. For example, the VRRP limit of 255 load balancers per network doesn’t exist in MetalLB. You can have as many load-balanced IPs as you want, as long as there are free IPs in your network. MetalLB also requires less configuration than VRRP–for example, there are no Virtual Router IDs.

因为 MetalLB没有使用VRRP，所以不会受到VRRP的限制，例如，VRRP在每个网络中限制255个负载均衡，而在MetalLB中你想有多少个负责均衡你就可以建多少个，只要你的网络里有足够的空余ip地址，MetalLB配置上也比VRRP简单，例如不需要虚拟路由器ID。

On the flip side, because MetalLB relies on memberlist for cluster membership information, it cannot interoperate with third-party VRRP-aware routers and infrastructure. This is working as intended: MetalLB is specifically designed to provide load balancing and failover within a Kubernetes cluster, and in that scenario interoperability with third-party LB software is out of scope.

MetalLB通过 memberlist获得集群中组员的信息，而这种技术路线的消息互通不能依赖VRRP为基础的路由器和框架，这种技术路线的工作方式是：MetalLB被设计用来在K8S集群上提供负载均衡和故障转移，这种场景下不能与K8S集群外部的第三方的负载均衡软件进行互通。

数据库

文章转载自零壹陋室，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。