Hybrid k3s #3: Pods couldn't talk to each other — flannel VXLAN and vmnet


Hybrid k3s — full architecture

0. About this series

This series is a record — written one piece at a time — of how I actually built the homelab in the diagram above, the one that’s still running as I write this.

What began as a toy project from a simple “could this even work?” turned, through satisfying performance and endless tearing-down-and-rebuilding, into a genuine toy that takes the edge off the stress built up at work. It isn’t a resource-rich cluster, but it’s been more than enough to get a real taste of Kubernetes, and it keeps handing me the next thing I want to try.

  • 6 nodes — 2 Lightsail servers (control plane + etcd) in the cloud (AWS Tokyo) + 4 Lima VM agents on a home (Sapporo) iMac
  • 19 vCPU / 61 GiB total, 49 namespaces, 248 pods (150 running)
  • Deployed with ArgoCD, auth via Keycloak OIDC, with CloudNativePG, Vault, CrowdSec, Prometheus/Grafana and more running on top

In #1 I stood up two cloud control-plane nodes, and in #2 I welcomed the home iMac as four Lima VM agents — and all six nodes went Ready. This article is about what came next: the nodes were all connected, yet the Pods themselves couldn’t talk across nodes. I peer into flannel, follow where the packets actually go, and end up binding the VMs on the same iMac directly with vmnet.

1. Six nodes Ready, yet the Pods were strangers

The picture I left off on last time looked clean. Running kubectl get nodes, the two Tokyo servers and four Sapporo Lima agents were all Ready, with a Tailscale 100.x address in INTERNAL-IP.

$ kubectl get nodes -o wide
NAME               STATUS   ROLES                AGE    VERSION        INTERNAL-IP     OS-IMAGE             CONTAINER-RUNTIME
ip-172-26-2-70…    Ready    control-plane,etcd   143d   v1.34.3+k3s1   100.99.x.x      Amazon Linux 2023   containerd://2.1.5-k3s1
ip-172-26-3-146…   Ready    control-plane,etcd   143d   v1.34.3+k3s1   100.71.x.x      Amazon Linux 2023   containerd://2.1.5-k3s1
lima-k3s-agent     Ready    <none>               143d   v1.34.3+k3s1   100.84.x.x      Ubuntu 24.04.3 LTS  containerd://2.1.5-k3s1
lima-k3s-agent-2   Ready    <none>               143d   v1.34.3+k3s1   100.98.x.x      Ubuntu 24.04.3 LTS  containerd://2.1.5-k3s1
lima-k3s-agent-3   Ready    <none>               143d   v1.34.3+k3s1   100.117.x.x     Ubuntu 24.04.3 LTS  containerd://2.1.5-k3s1
lima-k3s-agent-4   Ready    <none>               143d   v1.34.3+k3s1   100.90.x.x      Ubuntu 25.10        containerd://2.1.5-k3s1

I thought I was done. So I eagerly started piling Pods on — and almost immediately hit a strange wall. Pods on the same node talked just fine, but Pods on different nodes couldn’t reach each other. Services timed out in odd places, and some Pods couldn’t even resolve DNS.

At first I thought, “every node is Ready — so why?” That was a misread. Ready and “the Pod network works” are at two different layers.

  • Node Ready — the control-plane path, where a node’s kubelet trades heartbeats with the apiserver. That’s exactly what I’d set up so far: making the node and the apiserver reach each other over Tailscale 100.x. As long as this path is alive, a node looks Ready.
  • Pod ↔ Pod (across node boundaries) — the data-plane path, where Pods on different nodes exchange packets directly. This is a completely separate road from the control plane, and the thing that lays it isn’t the kubelet but the CNI (here, flannel).

Node Ready and Pod networking are different planes — control plane vs data plane

All six being Ready with a 100.x (Tailscale) INTERNAL-IP only means the control plane is sound. What finished last time went as far as “the nodes are recognized as members of one cluster” — the road that carries Pod traffic across node boundaries had not been verified yet.

Even with kubectl get nodes all Ready, inter-node Pod traffic isn’t guaranteed. The control plane (node ↔ apiserver) and the data plane (Pod ↔ Pod) are separate paths, and the latter is the CNI’s job. So the next question narrows to one thing — exactly how does flannel carry a Pod packet across node boundaries?

2. flannel and VXLAN — how a Pod packet crosses a node boundary

k3s uses flannel as its CNI and VXLAN as flannel’s default backend. In #1 I brought the servers up with --flannel-backend vxlan, and the #2 agents inherited that setting as-is. (k3s Basic Network Options — flannel’s default backend is vxlan; host-gw, wireguard-native, and none are the alternatives.)

Let me trace a Pod packet’s journey in two cases.

  • Within the same node — every Pod hangs off the node’s cni0 bridge. Two Pods on the same node meet directly at L2 on that bridge. They never cross a node boundary, so it’s fast and never congested. (That’s why “Pods on the same node talked” in §1.)
  • To another node — when the destination Pod is on a different node, the packet leaves cni0 and enters a virtual interface called flannel.1. This is VXLAN’s VTEP (VXLAN Tunnel Endpoint). Here the original Pod packet (Ethernet frame and all) is encapsulated whole inside a UDP packet and sent to the peer node.

The “address” and “port” that receive this capsule are the crux.

  • The port is UDP 8472. On Linux, flannel’s VXLAN backend uses the kernel default port 8472/udp (only on Windows does it use the IANA standard 4789). So nodes must be able to reach each other on 8472/udp. (flannel backends — “On Linux, defaults to kernel default, currently 8472” · k3s network requirements)
  • The address is each node’s advertised public-ip. flannel advertises, per node, a destination (the VTEP’s outer IP) that says “this is where I receive VXLAN capsules.” The default is the IP of that node’s default-route interface (the per-node real values are in §5). This “which address gets advertised” is what trips us up all the way through.

flannel VXLAN dataflow — from cni0 to flannel.1, encapsulated and sent to the peer on 8472

Encapsulation isn’t free. The VXLAN header (outer IP/UDP + VXLAN + inner Ethernet) eats an extra 50 bytes per packet. So flannel sets flannel.1’s MTU to the host interface’s MTU minus 50. If the host is 1500, flannel.1 becomes 1450.

On an agent node (lima-k3s-agent), it looks like this:

$ cat /run/flannel/subnet.env
FLANNEL_NETWORK=10.42.0.0/16
FLANNEL_SUBNET=10.42.2.1/24
FLANNEL_MTU=1450
FLANNEL_IPMASQ=true

$ ip -br link | grep -E 'eth0|lima0|tailscale0|flannel'
eth0         UP   ... mtu 1500
lima0        UP   ... mtu 1500
flannel.1    UNKNOWN ... mtu 1450     # 1500 - 50 (VXLAN)
tailscale0   UNKNOWN ... mtu 1280     # this 1280 bites in §5

That flannel.1 is a VXLAN device sending to port 8472 shows up in one line with -d (details):

$ ip -d link show flannel.1
5: flannel.1: <...> mtu 1450 qdisc noqueue state UNKNOWN ...
    vxlan id 1 local 192.168.105.2 dev lima0 srcport 0 0 dstport 8472 nolearning ttl auto ...

vxlan id 1 with dstport 8472 — “VXLAN → send to 8472” is right there in that line. The trailing local 192.168.105.2 dev lima0, i.e. “which address/interface this node sends VXLAN out of,” is the real key this time — but why it has that value is something §5 untangles.

MTU — the maximum size of a single packet

The output above showed mtu 1500, mtu 1450, and mtu 1280. Since this number bites hard in §5, let me pin it down here.

MTU (Maximum Transmission Unit) is the maximum size (in bytes) of a single packet an interface can carry at once. Ethernet’s standard default is 1500, so an ordinary NIC starts at 1500. A packet larger than the MTU is split into fragments, or — if it can’t be split — simply dropped. Fragmenting is slow, and dropping makes traffic look like it’s stalled.

The crux is that the more you wrap, the less real size fits inside. Wrap a box in a bigger box and the outer size (1500) stays the same, but what fits inside shrinks by the padding. That’s why each interface has a different MTU.

InterfaceMTUWhy this value
eth0 / lima01500bare Ethernet default, no encapsulation
flannel.114501500 − 50. The ceiling so that one VXLAN-header (50B) wrap still stays under 1500
tailscale01280WireGuard’s encryption overhead + a conservative value safe on any link (IPv6’s minimum MTU)

Each wrap shrinks the MTU — 1500 → 1450 (VXLAN) → 1230 (VXLAN over Tailscale)

If flannel VXLAN runs over physical Ethernet (1500), 1450 fits nicely. The problem is when inter-node traffic rides over Tailscale — a Pod packet gets wrapped once by VXLAN and again by WireGuard on top, doubly wrapped, and the size that fits inside shrinks further to about 1280 − 50 = 1230. flannel still sends as if it had 1450, and when the tunnel can’t accept that much, the gap erupts as fragmentation, drops, and retransmits. Why this “double encapsulation” wrecks even latency is covered in §5.

Inter-node Pod traffic ultimately reduces to “can the nodes exchange UDP 8472 to each other’s public-ip?” If that’s blocked, the Pod network breaks (§3); if the path is slow, the Pod network is slow (§5). It’s enough to remember that tailscale0’s MTU is 1280 — putting VXLAN on top of it shrinks the ceiling further.

3. Inter-node traffic failed, so I opened 8472

If §2’s conclusion holds, cross-node Pod traffic comes down to “can the nodes exchange 8472 (VXLAN) to each other’s public-ip?” So when traffic failed, there was one most-likely suspect — 8472 is blocked.

That’s because in #1 I’d closed the firewall by the book. Cluster ports like the apiserver (6443), kubelet (10250), and flannel VXLAN (8472) aren’t opened to the public net; they’re reachable only inside the private network / tailnet — the right default for minimizing exposure (#1’s firewall table). So “inter-node VXLAN is blocked on the public net, which is why inter-node Pod traffic fails” was a natural hypothesis.

Inside the node, flannel was indeed listening on 8472:

$ sudo ss -ulnp | grep 8472
UNCONN  0   0   0.0.0.0:8472   0.0.0.0:*

The port is open inside the node, yet it doesn’t reach Pods on another node — so what’s blocked isn’t inside the node but the road between nodes, I figured. On a “just make it work” impulse, I broke the principle and opened 8472/udp inbound on the Lightsail firewall. Inter-node Pod traffic went through, and I ran it that way.

Close 8472 and it blocks; open it and it flows — and it's exposed

Note — not a recommended setup. Unlike 6443, 8472 isn’t a port guarded by certificates and tokens; opening it to the public net itself widens the exposed surface. “It went through, so this is the answer” was what I thought at the time, but “opening the port made it work” doesn’t mean “the port is why it worked” — this judgment gets overturned in §8 when I look at the actual route.

When inter-node Pod traffic fails, suspecting inter-node 8472/udp (VXLAN) reachability first is a reasonable starting point. But here I opened it to the public net, and that was a debt. The bill arrives in the very next section.

4. I added Longhorn and the restarts wouldn’t stop

Now that Pods could talk, I wanted to run a stateful workload “like a real cluster.” I picked Longhorn — distributed block storage for Kubernetes.

The reason is simple. k3s’s default storage (local-path) is a hostPath tied to one node’s disk, so it has no replication — when a Pod moves to another node, the data doesn’t follow. To try “data survives even if a node dies,” you need distributed storage that replicates data across multiple nodes. Longhorn is the CNCF project that fills that role: it stands up a dedicated controller (the Longhorn Engine) per volume, treats each volume like a microservice, and keeps synchronous replicas of it on multiple nodes’ local disks. (Longhorn — What is Longhorn?)

Install was one Helm line, and at first it looked fine. Pods came up Running and volumes were created.

But almost immediately, errors and restarts began repeating endlessly. Volumes dropped to Degraded, replica syncs timed out, rebuilds spun up to fix them, that load made them fail again — a self-reinforcing vicious cycle. Even though I hadn’t rebuilt the cluster, nodes got shaken, and other Pods on top got dragged in.

The cause was inter-node latency. Longhorn’s synchronous replication waits, for every single write, until the replicas respond “written.” That’s why Longhorn’s docs state plainly that “latency is far more important to a volume’s stability than throughput or IOPS” (Longhorn Best Practices), and the troubleshooting guide goes further, recommending inter-node latency under 20ms when multiple volumes do I/O on one node at the same time (Longhorn KB — volume readonly / I/O error).

But latency between the home (lima) nodes was far above that. Measured, the replica’s synchronous-write latency averaged 200ms+ — over ten times the recommended 20ms. Since synchronous replication eats that whole latency on every write, volumes couldn’t climb out of Degraded, and rebuilds and restarts spun forever. In the end Longhorn effectively dropped the lima (home) nodes from storage, and the goal of “run stateful workloads on a real multi-node cluster” looked impossible on top of this latency.

Sync replication waits for the slowest node — Longhorn collapsed above 200ms

There was one more thing that didn’t add up. Those four too-slow home nodes are physically inside the same single iMac. They’re VMs in one machine — so where does 200ms come from? The next section follows that route.

5. Same host, so why slow? — to reach the next node, packets went to Tokyo and back

§4’s puzzle was this: four VMs inside one iMac, yet 200ms of inter-node latency. Let me follow where the packets actually go.

Clue 1 — there’s no direct path between VMs on the same host. Printing each VM’s address is odd:

$ limactl shell k3s-agent   -- ip -4 addr show eth0 | grep inet
    inet 192.168.5.15/24 ... eth0
$ limactl shell k3s-agent-2 -- ip -4 addr show eth0 | grep inet
    inet 192.168.5.15/24 ... eth0

Both VMs’ eth0 are 192.168.5.15. Lima’s default user-mode network fixes the subnet at 192.168.5.0/24, and each VM gets the same address behind its own independent NAT. So they’re unreachable directly from the host and from other guests (VMs) — they don’t even know each other, despite being inside the same iMac. Lima’s own docs note this limitation and point you to “use VMNet to access from the host or other guests” (Lima — user-mode network).

Clue 2 — with no direct path, even reaching the next VM takes the long-distance road. Since the VMs can’t reach each other directly, inter-node packets (including Pod VXLAN) take the one common path — the Tailscale (100.x) that bound both sites in #1 and #2. Reaching the VM right next door is no different. Tailscale is a tool for connecting long distances across NAT, so when a direct (P2P) path isn’t possible, it detours through a relay (DERP). Here’s what tailscale netcheck said:

$ tailscale netcheck
   * MappingVariesByDestIP: true       ← NAT where direct P2P is hard (endpoint-dependent mapping)
   * Nearest DERP: Tokyo (30.3ms)

With MappingVariesByDestIP: true, a direct connection isn’t possible, so Tailscale detours through the nearest DERP relay — Tokyo. A packet from Sapporo’s VM A to its neighbor VM B left the house, went all the way to Tokyo, and came back to Sapporo.

Same iMac VMs — yet packets went via Tokyo DERP and back

Clue 3 — measure it and that detour is right there. Latency between VMs on the same iMac:

lima ↔ lima-2 :  87 ms (max 202, jitter 54)
lima ↔ lima-3 : 133 ms (max 268, jitter 92)

Tens to a hundred ms for the VM right next door — the cost of a Tokyo round-trip. And at this point the home nodes’ Pod VXLAN, whose destination (public-ip) is a tailnet 100.x, has flannel VXLAN riding on top of Tailscale (WireGuard) as well — piling on §2’s double encapsulation and the MTU-1280 squeeze. Since Longhorn’s synchronous replication ate this round-trip on every write, 200ms was the obvious result.

Everything so far is about VMs inside the same iMac (lima ↔ lima). The cross-site link (home ↔ Tokyo) takes a different path (no double encapsulation there), covered separately in §8.

Nodes being on the same physical host doesn’t make them fast. When the virtualization network isolates the VMs, even talking to the VM right next door can detour far away through the overlay. Nearby traffic should end nearby — and §6 and §7 find that road.

6. How to fix it

The problem was clear. The four VMs on one iMac have no LAN that reaches each other directly, so even traffic to the VM next door goes through the long-distance Tailscale + DERP. I lined up the candidates for fixing it.

Fixing inter-node latency — 1-4 just rewrap on the slow path, 5 (vmnet) changes the path itself

  • Pin flannel to tailscale0 (--flannel-iface=tailscale0) — put VXLAN explicitly on the tailnet. But the root problem (a long-distance detour and double encap despite being on the same host) stays. And the servers use a VPC address; setting only the agents to tailscale0 makes the destinations disagree, and it breaks asymmetrically — traffic passes one way only (you’d have to change both, touching #1’s server config too). Latency doesn’t drop either.
  • flannel’s host-gw backend — route directly to node IPs with no encapsulation; fastest. But it assumes direct L2 connectivity between all nodes (k3s docs). There’s no L2 between Tokyo and Sapporo, and none between the user-mode-isolated VMs either.
  • flannel’s wireguard-native backend — encrypt with WireGuard instead of VXLAN. Good for security, but it still runs over the same detour path, so latency is unchanged.
  • Force Tailscale into P2P — connect directly instead of via DERP. But what blocks the direct path is Lima’s default user-mode network isolating the VMs, so it can’t be solved from the Tailscale side alone.

What 1–4 have in common is that they just change the wrapping on a slow path. The path itself (leaving the same machine and coming back) stays, so latency doesn’t drop. So I flipped the idea around.

  • Use the fact that they’re on the same host — give the VMs a real LAN. Since the four live in one iMac, if the host lays a virtual LAN and lets the VMs talk directly at L2, they never go through Tailscale or DERP at all. This is exactly the method Lima’s docs recommend for the user-mode isolation (VMNet), implemented on macOS via socket_vmnet.adopted.

The test for picking a candidate was one thing — “does it remove the root cause of the latency (no direct path)?” 1–4 try to “go faster” or “re-wrap with a different VPN” and leave the path alone, but vmnet changes the path itself into a LAN inside the house. Then Tailscale handles the long-distance Tokyo ↔ Sapporo leg, and same-host traffic ends at home. The next section applies it and checks the result.

7. Laying a LAN inside the house with socket_vmnet

7-1. vmnet and socket_vmnet

The problem from §5 was “the VMs on one iMac have no LAN that reaches each other directly.” What fills that gap is vmnet.

vmnet is macOS’s built-in virtual-networking framework (vmnet.framework) — Apple’s official API that builds NAT, bridges, and host networking for VMs. But using it directly requires the VM process to hold root privileges and an entitlement, which is a hassle. socket_vmnet is a small daemon built by the Lima project that wraps this vmnet.framework and exposes it over a Unix socket. Only the socket_vmnet daemon runs as root; the VMs just connect to that socket — the VMs themselves don’t need to run as root. (That’s why the install in 7-2 puts the binary in a root-owned /opt and lays down a sudoers entry.) (socket_vmnet)

socket_vmnet offers three modes:

  • shared — a private subnet (192.168.105.0/24) + internet NAT. The VMs connected to the same socket_vmnet sit on one virtual switch (L2) and talk to each other directly. ← what we need.
  • bridged — joins the VMs straight onto the host’s physical LAN (e.g. en0).
  • host — an isolated network with no internet.

(Lima — VMNet)

How the structure changes is the crux.

  • Before — each VM had only eth0 (Lima’s default user-mode, 192.168.5.15, mutually isolated) and tailscale0 (100.x). With no direct path between VMs, inter-node traffic (including Pod VXLAN) leaked onto tailscale0, causing §5’s long-distance detour.
  • After — each VM gets one more interface, lima0 (vmnet shared, 192.168.105.x). eth0 and tailscale0 stay as they were, and the k3s node’s InternalIP is still the Tailscale 100.x, so the cluster’s identity and membership don’t change. Only the data path changes — lima0 becomes the node’s default route, and per the rule from §2 (“flannel picks the default-route interface as its VXLAN destination / public-ip”), flannel moves its VXLAN destination to lima0. So Pod traffic between home nodes finishes over vmnet without going through the tailnet or DERP.

In short, it’s an additive change — without rebuilding the cluster or touching the nodes’ identity, you add one interface (lima0) and reroute only same-host traffic onto a fast LAN.

What vmnet changes — adding one lima0 interface

7-2. Install socket_vmnet (host = the iMac)

Because socket_vmnet is a daemon that runs as root, the binary must live on a root-owned path that a user can’t tamper with. Lima discourages installing it via Homebrew for security reasons, so build it from source into /opt/socket_vmnet. (Lima — VMNet)

git clone https://github.com/lima-vm/socket_vmnet
cd socket_vmnet
git checkout v1.2.2          # check the latest stable tag on the releases page
make
sudo make PREFIX=/opt/socket_vmnet install.bin
# → /opt/socket_vmnet/bin/socket_vmnet (root-owned)

7-3. Register the Lima sudoers entry

So Lima can launch socket_vmnet as root, lay down a sudoers fragment.

limactl sudoers > etc_sudoers.d_lima
less etc_sudoers.d_lima                 # review the contents first
sudo install -o root etc_sudoers.d_lima /etc/sudoers.d/lima
rm etc_sudoers.d_lima

7-4. Attach the shared network to each VM

Add the shared network to each VM’s ~/.lima/<vm>/lima.yaml. This network lays down 192.168.105.0/24 (gateway 192.168.105.1) and gives each VM an address in that range via a lima0 interface.

networks:
  - lima: shared
    interface: lima0

Note — duplicate networks: key. If a networks: key already exists in lima.yaml (even as a comment), appending another at the end causes a YAML duplicate-key parse error. Merge under the existing key.

7-5. Restart one at a time

Restart one VM at a time, confirming each goes Ready again (to minimize cluster impact).

for VM in k3s-agent k3s-agent-2 k3s-agent-3 k3s-agent-4; do
  limactl stop  "$VM"
  limactl start "$VM"
  kubectl --context k3s-lightsail wait --for=condition=Ready node/lima-"$VM" --timeout=180s
done

On restart, k3s comes back up and flannel re-picks its interface. Since the default route is now lima0 (192.168.105.1), flannel re-advertises its VXLAN destination (public-ip) as the vmnet address per the rule from §2 / §7-1 — with no extra flag like --flannel-iface.

7-6. Verify

Check that each VM got a vmnet address on lima0:

$ limactl shell k3s-agent -- ip -4 addr show lima0 | grep inet
    inet 192.168.105.2/24 ... lima0

Then re-measure ping between VMs on the same iMac:

$ limactl shell k3s-agent -- ping -c5 -q 192.168.105.3
5 packets transmitted, 5 received, 0% packet loss
rtt min/avg/max/mdev = 0.449/0.571/0.639/0.064 ms

The same-host VM-to-VM latency that was 87~133 ms dropped to the 0.5 ms range. The packets that had been round-tripping to Tokyo now finish inside the iMac.

After socket_vmnet — from an 87-133ms Tokyo detour to a 0.5ms direct lima0 path

By default, flannel picks “the default-route interface” as its VXLAN destination. So if you give a node a faster direct path (here, vmnet’s lima0) and make it the default route, flannel switches over to it on its own — with no extra CNI config.

8. Result — what vmnet changed, and the truth about 8472

After applying it, the VXLAN destination (public-ip) that flannel advertises splits cleanly into the two sites:

$ kubectl --context k3s-lightsail get nodes \
    -o custom-columns='NAME:.metadata.name,PUBLIC-IP:.metadata.annotations.flannel\.alpha\.coreos\.com/public-ip'
NAME               PUBLIC-IP
ip-172-26-2-70     172.26.2.70      # Tokyo (server) = VPC
ip-172-26-3-146    172.26.3.146
lima-k3s-agent     192.168.105.2    # home (lima) = vmnet
lima-k3s-agent-2   192.168.105.3
lima-k3s-agent-3   192.168.105.4
lima-k3s-agent-4   192.168.105.5

The four home nodes now exchange VXLAN directly over each other’s vmnet address (192.168.105.x) — 0.5 ms.

The 8472 I opened in §3 — I went to close it and looked at the actual route

With the home nodes sorted out by vmnet, it was time to close the 8472 I’d opened against principle in §3. But the moment I went to close it, I got nervous — what if closing it breaks inter-node Pod traffic again? Because in §3 I’d thought “opening it made it work.”

So before closing, instead of guessing I looked at the actual route. All you have to do is query the route from one node to a Pod on a node at the other site.

# from a home (lima) node to a Pod on a Tokyo (server) node
$ ip route get 10.42.0.235
10.42.0.235 dev tailscale0 table 52 src 100.84.x.x ...

dev tailscale0, not dev flannel.1. In other words, cross-site Pod traffic was flowing directly over Tailscale, not flannel VXLAN (8472).

Here’s the mechanism. In this cluster, each node advertises its own pod CIDR (10.42.N.0/24) as a Tailscale subnet route and accepts the others’ (accept-routes). So a remote node’s Pod range bypasses flannel and travels directly over the encrypted Tailscale (WireGuard). This is also the official way k3s binds a distributed / multi-cloud cluster with Tailscale (k3s — Distributed/multicloud, Tailscale — Subnet routers). So the “double encapsulation, wrapping VXLAN over WireGuard” from §2 / §5 was the old path of the same host (lima ↔ lima); cross-site is a single layer of WireGuard.

Having confirmed the route, I closed it — restricting it from public (0.0.0.0/0) to VPC-private (172.26.0.0/16). Then I re-measured after closing:

cross-site (home → Tokyo Pod)  : 20~26 ms   (Tailscale, unchanged)
home nodes (lima ↔ lima)       : 0.4 ms     (vmnet, unchanged)
servers (server ↔ server)      : 0.3 ms     (VPC VXLAN, unchanged)
6 nodes Ready                  : 6/6

Nothing broke. The public 8472 wasn’t the actual route of inter-node traffic — it was just a leftover. 8472 is still in use, but only inside private networks — lima ↔ lima over vmnet, server ↔ server over VPC.

Cross-site was flowing over Tailscale — the public 8472 was a leftover

Why bother closing it? VXLAN is a protocol with no authentication and no encryption (RFC 7348’s security considerations also state plainly that “VXLAN itself provides no authentication or encryption”). Its only identifier is the VNI, and flannel’s default is 1, so anyone who can reach 8472 on the public net can inject packets into the Pod overlay (10.42.0.0/16). 6443 at least has a gate of certificates and tokens; 8472 doesn’t even have that. So while I was at it, I also closed the apiserver (6443) and kubelet (10250) to the public net — VPC/tailnet only — and restricted SSH (22) to the tailnet.

“It worked after a change” doesn’t mean “the change is why it worked.” Opening 8472 and traffic going through was a fact, but what actually carried the traffic was Tailscale — the public 8472 was a leftover from the start. Before you open and close ports on a guess, read the actual route once with ip route get — that’s the cheapest way to cut costly exposure.

A remaining limit — cross-site is far, by the distance

What vmnet fixed is inside the same host (between home nodes). The Tokyo-cloud ↔ Sapporo-home leg is physically far apart, so it’s bound by Tailscale, and that latency (tens of ms) is a value set by distance that you can’t shrink.

So I compensate with placement — keep workloads with a lot of inter-node synchronous traffic (latency-sensitive ones) gathered within the home node group. I use the node-type=lima label I’d put on the home nodes back in #2, via a nodeSelector.

$ kubectl --context k3s-lightsail get nodes -L node-type
NAME               ...   NODE-TYPE
ip-172-26-2-70     ...   lightsail
ip-172-26-3-146    ...   lightsail
lima-k3s-agent     ...   lima
lima-k3s-agent-2   ...   lima
lima-k3s-agent-3   ...   lima
lima-k3s-agent-4   ...   lima

In a workload’s manifest, you write it like this (if the label isn’t there, set it first with kubectl label node lima-k3s-agent node-type=lima):

spec:
  template:
    spec:
      nodeSelector:
        node-type: lima       # this Pod group only on home nodes (vmnet, 0.5ms)

Final picture — vmnet for what's near, Tailscale for what's far, placement via nodeSelector

Even within one cluster, inter-node latency isn’t uniform. Accept the fast leg (same host = vmnet) and the slow leg (long distance = Tailscale), and place workloads by latency with a nodeSelector to steer around the slow leg.

9. Glossary — what came up this time

  • flannel / VXLAN — k3s’s default CNI is flannel, its default backend VXLAN. It carries cross-node Pod packets encapsulated in UDP (8472).
  • VTEP / flannel.1 — the endpoint that wraps and unwraps VXLAN capsules. Exists per node as the flannel.1 interface.
  • flannel’s public-ip — the destination each node advertises as “this is where I receive my VXLAN.” Defaults to the node’s default-route interface IP.
  • MTU — the maximum size of a single packet. Shrinks the more you encapsulate (Ethernet 1500 → VXLAN 1450 → 1230 over Tailscale).
  • double encapsulation — wrapping a VXLAN packet again in WireGuard. Overhead, MTU squeeze, and latency pile up. (In this cluster it only happened for lima ↔ lima before vmnet; not for cross-site.)
  • Tailscale / DERP — a WireGuard-based mesh VPN. When a direct (P2P) path isn’t possible, it detours through a DERP relay.
  • Tailscale subnet route — a node advertises a range (here, its own pod CIDR) with --advertise-routes, another node receives it with --accept-routes, and that range’s traffic travels over the tailnet (WireGuard). Cross-site Pod traffic flows over this road.
  • NAT (endpoint-dependent mapping) — a NAT where the port mapping varies by destination. Direct P2P is hard, so it falls back to DERP.
  • vmnet (vmnet.framework) — Apple’s framework that provides NAT, bridges, and host networking to VMs on macOS.
  • socket_vmnet — a root daemon that exposes vmnet.framework over a Unix socket. VMs connect to the socket without root, get a shared LAN (192.168.105.0/24), and talk directly at L2 between VMs on the same host.
  • Lima user-mode network — Lima’s default network (fixed at 192.168.5.0/24). Isolates VMs from each other and from the host.
  • node-type label / nodeSelector — a label on the nodes (here, lima/lightsail). Used in a nodeSelector to place workloads on a particular group of nodes.

10. Next

With inter-node traffic sorted out, I can finally run all sorts of services stably on these six nodes.

In the end, the important problem that Longhorn surfaced was solved well, and it’s still running stably today. That said, given the nodes’ spec limits, I decided to give up on Longhorn for now. Once I have a roomier environment, I’m leaving “put it back and test it” as homework.

What I’ve written up to now lets me run a fair range of services, but optimization and other small issues I’ve been handling as I operate. Once the material piles up a bit, I’d like to gather and organize that too.

Next time I plan to talk about CloudNativePG (CNPG). Right as the inter-node networking got solved, I set up CNPG to practice and verify running a service with internal clustering — and it’s now serving as the main DB for quite a few services.

Thanks for reading all the way through.


References