Hybrid k3s #1: Cloud and home into one cluster — initial setup


Hybrid k3s — current architecture

0. About this series

This series is a record — written one piece at a time — of how I actually built the homelab shown in the diagram above, the one I’m running right now.

What started as a toy project from a simple “could this even work?” turned, through satisfying performance and endless tearing-down-and-rebuilding, into a genuine toy that relieves the stress built up at work.

It isn’t a resource-rich cluster, but it has been more than enough to get a real taste of Kubernetes, and it keeps giving me new things I want to try next.

  • 6 nodes — 2 Lightsail servers (control plane + etcd) in the cloud (AWS Tokyo) + 4 Lima VM agents on a home (Sapporo) iMac
  • 19 vCPU / 61 GiB total, 49 namespaces, 248 pods (150 running)
  • Deployment via ArgoCD, authentication via Keycloak OIDC, with CloudNativePG, Vault, CrowdSec, Prometheus/Grafana, and more running on top

It wasn’t easy, but it wasn’t hard enough to give up on either — so I’m going to write up, one at a time, the things I learned while building it and the things I want to keep.

This first story is about the foundation — how I started from two control-plane nodes in the cloud.

1. Background

There was no grand blueprint to begin with. The starting point was ordinary.

Working with Kubernetes in my day job, things I want to dig into more keep coming up. Reading the docs is one thing; breaking and fixing a cluster with my own hands is another. There’s an environment I can touch at work too, but it’s limited, and a careless mistake there leads to noisy, annoying situations — so there were limits.

I needed a cluster I could run however I wanted.

As it happened, a 64GB-RAM iMac, more than 10 years old, was sitting mostly idle at home. It still performs well enough, but it has an HDD so it’s slow, its OS is past end-of-support, and it has handed its seat to a MacBook Pro M4 and is now resting. On the cloud side, I already had two small Lightsail instances running personal services, and as those services grew, resources were gradually getting tight.

“What if I stopped keeping the idle home machine’s resources and the cloud I’m already paying for separate, and used them as one?”

The urge to learn and the pressure on resources converged on a single idea — combine the cloud and home into one cluster. This article is the first dig: building the cloud-side foundation.

2. Why k3s — a choice under limited resources

First, let’s prepare a Kubernetes (k8s) environment.

But for the resources I had in my cloud environment, standard k8s was too heavy. In my dreams I wanted to run wild on a multi-cluster with thousands of nodes; in reality it was a small AWS Lightsail instance of about $150/month and a single 10-plus-year-old iMac near retirement.

I had to pick “which Kubernetes to go with” first. Here’s what my research turned up.

OptionCharacterFor this situation
Managed (EKS/GKE/AKS)The cloud runs the control plane for youControl-plane fee + node cost → conflicts with low cost / reusing idle gear, excluded
Vanilla Kubernetes (kubeadm)Assemble upstream yourselfThe most orthodox but heavy and hands-on → a burden for low-spec/small scale, excluded
k3s (Rancher/SUSE)Single-binary lightweight distroLightweight distro — finalist
k0s · MicroK8sLightweight distros of a similar kindLikewise lightweight distros — finalist
minikube · kindFor local dev/testingNot meant for persistent multi-node operation → excluded

Filtering this way, the candidates narrowed to three lightweight distros: k3s, k0s, and MicroK8s. Digging deeper into the three:

Itemk3s (chosen)k0sMicroK8s
MakerRancher/SUSEMirantisCanonical
PackagingSingle binarySingle binarysnap package (depends on snapd)
Default datastoreSQLite (kine); embedded etcd for HAetcd standard (kine for other DBs too)dqlite (distributed SQLite, Raft)
HA approachSwitches to etcd with multiple serversProvided by defaultAutomatic HA at 3+ nodes
Control planeserver also runs workloadsInternal components as separate processes, control-plane isolationPer node
Default CNIflannel (lightweight, limited policy)kube-router/calicocalico (HA variant)
BundlingEssential components included (Traefik, ServiceLB, local-path…)Minimal, easy to swap default componentsEnable add-ons with microk8s enable

Why k3s.

All three are CNCF-compliant lightweight distros, but they differ in character.

k0s keeps the control plane separate from workloads, which is clean, but it ships with fewer things, so there’s more to plug in yourself.

MicroK8s has the convenience of enabling add-ons with a single microk8s enable line, but in return it’s tied to snap, and there are reported cases of dqlite CPU/consensus instability on write-heavy clusters. (GitHub Issue #3227)

k3s, on the other hand, has essential components bundled into a single binary, so the initial setup is the fastest, and the path of moving to embedded etcd with multiple servers fits naturally with this kind of “cloud + home HA.” Add low-spec/ARM support and the depth of its docs and community, and for the goal of learning and low-cost operation at once, k3s fit best. (comparison sources: Palark · Portainer · nOps)

k3s repackages that Kubernetes as a single binary (under 100MB) while staying 100% compatible (CNCF certified). Its requirements are essentially just a modern kernel + cgroups, so it’s no strain even on low-spec hardware. (What is K3s)

Just three reasons it’s light:

  1. Single binary, single process. Components that run separately in regular Kubernetes — kube-apiserver, kube-scheduler, kube-controller-manager, kubelet, kube-proxy — are wrapped into one k3s process, with the containerd runtime built in. (Architecture)
  2. Flexible datastore. A single server uses SQLite by default; with multiple servers, embedded etcd is selected automatically (external MySQL/Postgres are also possible). (Datastore)
  3. Essential components included. flannel (CNI), CoreDNS, Traefik (Ingress), ServiceLB, local-path (storage), and metrics-server are brought up together at install time. That’s that much less to assemble yourself.

As a bonus, k3s nodes come in two kinds — server (control plane + datastore) and agent (workload only) — which made it a good match for a hybrid setup like “cloud = server, home = agent.” You’ll see this in the diagrams from chapter 4 onward.

3. The control plane — three is the rule, but a two-node challenge

Originally I ran personal services in the cloud with Docker Compose. The small instance handled the DB, and the large instance handled several microservices. Moving these two to Kubernetes, my first worry was the control plane.

For Kubernetes to be stable, control-plane HA is the baseline. k3s’s embedded etcd can’t accept writes unless it keeps a majority (quorum), and the official HA guide recommends 3 or more servers (an odd number). With n nodes the quorum is (n/2)+1, and the node count minus the quorum is how many node failures you can tolerate.

serversquorumfailures tolerated
110
220
321
431

etcd quorum — 2 vs 3

The rule is three. But adding one more instance was tight on the wallet, so I changed the goal:

I know three is the right answer, but for now let me run two as stably as possible.

In choosing two, I made two things clear.

First, don’t pile everything on one node.

I once put the control plane and services all on a single node and got badly burned. Lightsail is a burstable CPU model: each plan has a per-vCPU baseline %, and when load stays above it for a while it spends the burst capacity it had accrued, dropping to baseline once it hits 0. With the control plane (apiserver, etcd) on the same node, the moment the CPU dries up, cluster control itself stops — so I split the load across two nodes.

nodeplanvCPUbaselinerole
server-A8GB ($44/mo)230%cluster-init · control-plane+etcd+worker
server-B16GB ($84/mo)440%join · control-plane+etcd+worker

Checking usage at the time of writing, both are below baseline (the sustainable zone), accruing burst (kubectl top nodes):

NAME           CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
cp-8gb-init    482m         24%    4565Mi          58%
cp-16gb-join   1153m        28%    10096Mi         65%

Lightsail burst CPU

Second, admit that two is not HA, and take out insurance.

As the table shows, with two nodes, losing even one loses quorum and writes stop (pods already running keep going under kubelet, so it’s “no changes” rather than “total outage”). I cover that risk with etcd automatic snapshots. Since I gave no extra config, it runs with k3s defaults — 0 */12 * * * (twice a day), keep 5, stored at /var/lib/rancher/k3s/server/db/snapshots. (etcd-snapshot) Since they only pile up locally, pushing them to NAS/object storage later is a task I’ve left for the backup installment.

4. Today’s star — Tailscale

The control plane is on Lightsail in Tokyo; the machine I’ll use as a worker is the home iMac in Sapporo.

These two don’t share a private network.

The home machine sits behind a router on a private IP (192.168.x), so it can’t be reached directly from outside, and opening ports to expose it would mean exposing cluster ports like kubelet (10250) and VXLAN (8472) to the internet — dangerous. For k3s to bind nodes into one cluster, everyone has to be able to call each other by one stable address, and the current setup doesn’t have that.

So I went looking for a method among VPNs and meshes.

OptionCharacterFor this situation
Direct port exposure + public IPExpose as-is without a VPNEffectively exposes kubelet/VXLAN to the internet → dangerous, dropped
raw WireGuardFast kernel VPN, manual keys/peersFast, but NAT traversal, key management, and access control are all manual
OpenVPNTraditional hub-style VPNHub-centric rather than mesh, heavy to set up
ZeroTierManaged mesh VPNA solid candidate, similar in flavor
TailscaleWireGuard + coordination (mesh)Automatic NAT traversal, ACLs, MagicDNS, unattended keys, free for personal use ← chosen
HeadscaleSelf-hosted Tailscale control serverMore freedom but the burden of self-operation → consider later

After a lot of trial and deliberation that took plenty of time, in the end I chose Tailscale. It’s a WireGuard-based mesh VPN: install a daemon on each machine and log in, and it joins a private network (a tailnet) tied to your account, with each machine getting one address in the 100.x range. That address is reachable by the same value from anywhere — whether the machine is in Tokyo or behind a router in Sapporo — and Tailscale handles NAT traversal for you.

It means you can lay down a “virtual LAN” that puts the cloud and home on one plane. (And up to 100 machines register for free.)

When k3s registers a node, it stamps the address given via --node-ip as that node’s identity (InternalIP). So by setting this value to a Tailscale address from the start, a home node joining later lands on the same 100.x plane as-is. That’s why I install Tailscale before k3s.

5. Tailscale: sign up · install · verify

The order is sign up → install → verify.

① Sign up. Log in at login.tailscale.com with an SSO account like Google, GitHub, or Microsoft, and a tailnet for that account is created automatically. There’s no separate signup form; SSO is the signup.

Tailscale sign-in screen

② (For servers) Prepare an auth key. Cloud servers have no browser, so issue an auth key (tskey-…) in advance from the admin console under Settings → Keys. You can skip this if you’ll connect interactively.

Tailscale admin console Keys

③ Install & connect. On each of the two cloud nodes (Amazon Linux 2023):

curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up                 # authenticate via the printed URL (headless: --authkey tskey-… )
tailscale ip -4                   # this node's 100.x address — used directly as --node-ip in ch.6

④ Verify. If both nodes appear in the admin console Machines page (login.tailscale.com/admin/machines) with their 100.x address and hostname, it worked.

Tailscale admin console Machines list

You can also check from the node:

tailscale status                  # list of machines in the tailnet + each one's 100.x

With this, the two cloud nodes see each other by 100.x in one tailnet. Now I bring up k3s with these addresses. (Tailscale Linux install)

6. Installing k3s (with Tailscale addresses)

Put the 100.x you got in chapter 5 straight into --node-ip.

Bootstrap & join flow

server-A (8GB):

curl -sfL https://get.k3s.io | K3S_TOKEN=<shared-secret> INSTALL_K3S_VERSION=v1.34.3+k3s1 \
  sh -s - server \
    --cluster-init \
    --node-ip 100.71.x.x \
    --node-external-ip <publicA> \
    --advertise-address 100.71.x.x \
    --flannel-backend vxlan
  • --cluster-init — initializes embedded etcd as the first server. (server flags)
  • --node-ip 100.71.x.x — advertises the Tailscale address received in ch.5 as the InternalIP.
  • --node-external-ip / --advertise-address — public IP (for external exposure), apiserver advertise address (Tailscale).
  • --flannel-backend vxlan — CNI backend (the default, stated explicitly).

K3S_TOKEN can be a value you set yourself, like choosing a password, or left blank for k3s to generate automatically. But since you need to know this value to join, save it separately or just pass the value at the path below.

  • /var/lib/rancher/k3s/server/node-token

server-B (16GB) — joins as the second server. This node, too, joins the tailnet first, then just connects with the same token:

curl -sfL https://get.k3s.io | K3S_TOKEN=<secret> INSTALL_K3S_VERSION=v1.34.3+k3s1 \
  sh -s - server \
    --server https://172.26.x.x:6443 \
    --node-ip 100.99.x.x
  • --server https://172.26.x.x:6443 = server-A’s address (a private IP, since it’s the same VPC).
  • --node-ip 100.99.x.x = this node’s Tailscale address.

The two Lightsail boxes are in the same AWS VPC, so joining itself used the private IP, but the InternalIP advertised to the cluster is Tailscale (100.x) for both.

Firewall — open only the minimum externally. (requirements)

portuseexposure
80 / 443Traefik Ingressall
22SSHmy IP only
6443 / 2379-2380 / 8472 / 10250apiserver·etcd·flannel·kubeletclosed publicly, private/Tailscale internal only

7. Cluster setup — complete with two nodes

Attaching the home iMac as an agent is covered in the next article.

For now I’ve built the cluster with two Lightsail boxes, Tailscale applied. Listing the nodes, you can confirm both are Ready on the same version and runtime.

kubectl get nodes -o wide

NAME           STATUS  ROLES               AGE   VERSION       INTERNAL-IP   EXTERNAL-IP    OS-IMAGE                       KERNEL-VERSION              CONTAINER-RUNTIME
…3-146(8GB)    Ready   control-plane,etcd  139d  v1.34.3+k3s1  100.71.x.x    52.x.x.x       Amazon Linux 2023.7.20250512   6.1.134-…amzn2023.x86_64    containerd://2.1.5-k3s1
…2-70(16GB)    Ready   control-plane,etcd  139d  v1.34.3+k3s1  100.99.x.x    3.x.x.x        Amazon Linux 2023.9.20251105   6.1.156-…amzn2023.x86_64    containerd://2.1.5-k3s1

Check whether the two nodes are etcd voting members (look at Conditions in kubectl describe node <name>):

Conditions:
  Type             Status   Reason                       Message
  ----             ------   ------                       -------
  EtcdIsVoter      True     MemberNotLearner             Node is a voting member of the etcd cluster
  MemoryPressure   False    KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False    KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False    KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True     KubeletReady                 kubelet is posting ready status

Check that the k3s default bundle came up too (kubectl get pods -n kube-system):

# kubectl get pods -n kube-system  →  k3s default bundle only (excerpt)
coredns-7f496c8d7d-nx9jc                  1/1  Running    139d   # DNS
local-path-provisioner-578895bd58-mgxpm   1/1  Running    139d   # local storage (default SC)
metrics-server-7b9c9c4b9c-76ldg           1/1  Running    139d   # metrics (kubectl top)
traefik-78df465dcc-66kn8                  1/1  Running    9d     # Ingress (server-A)
traefik-78df465dcc-gs4q7                  1/1  Running    8d     # Ingress (server-B) → one per node = 2 replicas
helm-install-traefik-crd-pmk4t            0/1  Completed  139d   # Helm Job that installed the bundle (completed)

That concludes setting up two cloud instances as a k3s cluster. It isn’t just that I installed k3s — I also configured Tailscale so that, later, any machine can join as an agent regardless of where it is or what form it takes, as long as it’s an environment where k3s can be configured.

8. Next

The AWS Lightsail nodes are now formed into a cluster, and the groundwork for nodes to join is all set.

In the end it came down to one command per node, but this stage took more time than I expected.

To this two-node cluster, I’ll now bring in the iMac resting at home, in earnest. I’ll install Lima VMs on the iMac, create an agent on each, join them to the same tailnet, and write up the problems I ran into after joining — solving them along the way.


References