Hybrid k3s #1: Cloud and home into one cluster

Hybrid k3s — current architecture

0. About this series

This series is a record — written one piece at a time — of how I actually built the homelab shown in the diagram above, the one I’m running right now.

What started as a toy project from a simple “could this even work?” turned, through satisfying performance and endless tearing-down-and-rebuilding, into a genuine toy that relieves the stress built up at work.

It isn’t a resource-rich cluster, but it has been more than enough to get a real taste of Kubernetes, and it keeps giving me new things I want to try next.

6 nodes — 2 Lightsail servers (control plane + etcd) in the cloud (AWS Tokyo) + 4 Lima VM agents on a home (Sapporo) iMac
19 vCPU / 61 GiB total, 49 namespaces, 248 pods (150 running)
Deployment via ArgoCD, authentication via Keycloak OIDC, with CloudNativePG, Vault, CrowdSec, Prometheus/Grafana, and more running on top

It wasn’t easy, but it wasn’t hard enough to give up on either — so I’m going to write up, one at a time, the things I learned while building it and the things I want to keep.

This first story is about the foundation — how I started from two control-plane nodes in the cloud.

1. Background

There was no grand blueprint to begin with. The starting point was ordinary.

Working with Kubernetes in my day job, things I want to dig into more keep coming up. Reading the docs is one thing; breaking and fixing a cluster with my own hands is another. There’s an environment I can touch at work too, but it’s limited, and a careless mistake there leads to noisy, annoying situations — so there were limits.

I needed a cluster I could run however I wanted.

As it happened, a 64GB-RAM iMac, more than 10 years old, was sitting mostly idle at home. It still performs well enough, but it has an HDD so it’s slow, its OS is past end-of-support, and it has handed its seat to a MacBook Pro M4 and is now resting. On the cloud side, I already had two small Lightsail instances running personal services, and as those services grew, resources were gradually getting tight.

“What if I stopped keeping the idle home machine’s resources and the cloud I’m already paying for separate, and used them as one?”

The urge to learn and the pressure on resources converged on a single idea — combine the cloud and home into one cluster. This article is the first dig: building the cloud-side foundation.

2. Why k3s — a choice under limited resources

First, let’s prepare a Kubernetes (k8s) environment.

But for the resources I had in my cloud environment, standard k8s was too heavy. In my dreams I wanted to run wild on a multi-cluster with thousands of nodes; in reality it was a small AWS Lightsail instance of about $150/month and a single 10-plus-year-old iMac near retirement.

I had to pick “which Kubernetes to go with” first. Here’s what my research turned up.

Option	Character	For this situation
Managed (EKS/GKE/AKS)	The cloud runs the control plane for you	Control-plane fee + node cost → conflicts with low cost / reusing idle gear, excluded
Vanilla Kubernetes (kubeadm)	Assemble upstream yourself	The most orthodox but heavy and hands-on → a burden for low-spec/small scale, excluded
k3s (Rancher/SUSE)	Single-binary lightweight distro	Lightweight distro — finalist
k0s · MicroK8s	Lightweight distros of a similar kind	Likewise lightweight distros — finalist
minikube · kind	For local dev/testing	Not meant for persistent multi-node operation → excluded

Filtering this way, the candidates narrowed to three lightweight distros: k3s, k0s, and MicroK8s. Digging deeper into the three:

Item	k3s (chosen)	k0s	MicroK8s
Maker	Rancher/SUSE	Mirantis	Canonical
Packaging	Single binary	Single binary	snap package (depends on snapd)
Default datastore	SQLite (kine); embedded etcd for HA	etcd standard (kine for other DBs too)	dqlite (distributed SQLite, Raft)
HA approach	Switches to etcd with multiple servers	Provided by default	Automatic HA at 3+ nodes
Control plane	server also runs workloads	Internal components as separate processes, control-plane isolation	Per node
Default CNI	flannel (lightweight, limited policy)	kube-router/calico	calico (HA variant)
Bundling	Essential components included (Traefik, ServiceLB, local-path…)	Minimal, easy to swap default components	Enable add-ons with `microk8s enable`

Why k3s.

All three are CNCF-compliant lightweight distros, but they differ in character.

k0s keeps the control plane separate from workloads, which is clean, but it ships with fewer things, so there’s more to plug in yourself.

MicroK8s has the convenience of enabling add-ons with a single microk8s enable line, but in return it’s tied to snap, and there are reported cases of dqlite CPU/consensus instability on write-heavy clusters. (GitHub Issue #3227)

k3s, on the other hand, has essential components bundled into a single binary, so the initial setup is the fastest, and the path of moving to embedded etcd with multiple servers fits naturally with this kind of “cloud + home HA.” Add low-spec/ARM support and the depth of its docs and community, and for the goal of learning and low-cost operation at once, k3s fit best. (comparison sources: Palark · Portainer · nOps)

k3s repackages that Kubernetes as a single binary (under 100MB) while staying 100% compatible (CNCF certified). Its requirements are essentially just a modern kernel + cgroups, so it’s no strain even on low-spec hardware. (What is K3s)

Just three reasons it’s light:

Single binary, single process. Components that run separately in regular Kubernetes — kube-apiserver, kube-scheduler, kube-controller-manager, kubelet, kube-proxy — are wrapped into one k3s process, with the containerd runtime built in. (Architecture)
Flexible datastore. A single server uses SQLite by default; with multiple servers, embedded etcd is selected automatically (external MySQL/Postgres are also possible). (Datastore)
Essential components included. flannel (CNI), CoreDNS, Traefik (Ingress), ServiceLB, local-path (storage), and metrics-server are brought up together at install time. That’s that much less to assemble yourself.

As a bonus, k3s nodes come in two kinds — server (control plane + datastore) and agent (workload only) — which made it a good match for a hybrid setup like “cloud = server, home = agent.” You’ll see this in the diagrams from chapter 4 onward.

3. The control plane — three is the rule, but a two-node challenge

Originally I ran personal services in the cloud with Docker Compose. The small instance handled the DB, and the large instance handled several microservices. Moving these two to Kubernetes, my first worry was the control plane.

For Kubernetes to be stable, control-plane HA is the baseline. k3s’s embedded etcd can’t accept writes unless it keeps a majority (quorum), and the official HA guide recommends 3 or more servers (an odd number). With n nodes the quorum is (n/2)+1, and the node count minus the quorum is how many node failures you can tolerate.

servers	quorum	failures tolerated
1	1	0
2	2	0
3	2	1
4	3	1

etcd quorum — 2 vs 3

The rule is three. But adding one more instance was tight on the wallet, so I changed the goal:

I know three is the right answer, but for now let me run two as stably as possible.

In choosing two, I made two things clear.

First, don’t pile everything on one node.

I once put the control plane and services all on a single node and got badly burned. Lightsail is a burstable CPU model: each plan has a per-vCPU baseline %, and when load stays above it for a while it spends the burst capacity it had accrued, dropping to baseline once it hits 0. With the control plane (apiserver, etcd) on the same node, the moment the CPU dries up, cluster control itself stops — so I split the load across two nodes.

node	plan	vCPU	baseline	role
server-A	8GB ($44/mo)	2	30%	cluster-init · control-plane+etcd+worker
server-B	16GB ($84/mo)	4	40%	join · control-plane+etcd+worker

Checking usage at the time of writing, both are below baseline (the sustainable zone), accruing burst (kubectl top nodes):

NAME           CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
cp-8gb-init    482m         24%    4565Mi          58%
cp-16gb-join   1153m        28%    10096Mi         65%

Lightsail burst CPU

Second, admit that two is not HA, and take out insurance.

As the table shows, with two nodes, losing even one loses quorum and writes stop (pods already running keep going under kubelet, so it’s “no changes” rather than “total outage”). I cover that risk with etcd automatic snapshots. Since I gave no extra config, it runs with k3s defaults — 0 */12 * * * (twice a day), keep 5, stored at /var/lib/rancher/k3s/server/db/snapshots. (etcd-snapshot) Since they only pile up locally, pushing them to NAS/object storage later is a task I’ve left for the backup installment.

4. Today’s star — Tailscale

The control plane is on Lightsail in Tokyo; the machine I’ll use as a worker is the home iMac in Sapporo.

These two don’t share a private network.

The home machine sits behind a router on a private IP (192.168.x), so it can’t be reached directly from outside, and opening ports to expose it would mean exposing cluster ports like kubelet (10250) and VXLAN (8472) to the internet — dangerous. For k3s to bind nodes into one cluster, everyone has to be able to call each other by one stable address, and the current setup doesn’t have that.

So I went looking for a method among VPNs and meshes.

Option	Character	For this situation
Direct port exposure + public IP	Expose as-is without a VPN	Effectively exposes kubelet/VXLAN to the internet → dangerous, dropped
raw WireGuard	Fast kernel VPN, manual keys/peers	Fast, but NAT traversal, key management, and access control are all manual
OpenVPN	Traditional hub-style VPN	Hub-centric rather than mesh, heavy to set up
ZeroTier	Managed mesh VPN	A solid candidate, similar in flavor
Tailscale	WireGuard + coordination (mesh)	Automatic NAT traversal, ACLs, MagicDNS, unattended keys, free for personal use ← chosen
Headscale	Self-hosted Tailscale control server	More freedom but the burden of self-operation → consider later

After a lot of trial and deliberation that took plenty of time, in the end I chose Tailscale. It’s a WireGuard-based mesh VPN: install a daemon on each machine and log in, and it joins a private network (a tailnet) tied to your account, with each machine getting one address in the 100.x range. That address is reachable by the same value from anywhere — whether the machine is in Tokyo or behind a router in Sapporo — and Tailscale handles NAT traversal for you.

It means you can lay down a “virtual LAN” that puts the cloud and home on one plane. (And up to 100 machines register for free.)

When k3s registers a node, it stamps the address given via --node-ip as that node’s identity (InternalIP). So by setting this value to a Tailscale address from the start, a home node joining later lands on the same 100.x plane as-is. That’s why I install Tailscale before k3s.

The order is sign up → install → verify.

① Sign up. Log in at login.tailscale.com with an SSO account like Google, GitHub, or Microsoft, and a tailnet for that account is created automatically. There’s no separate signup form; SSO is the signup.

② (For servers) Prepare an auth key. Cloud servers have no browser, so issue an auth key (tskey-…) in advance from the admin console under Settings → Keys. You can skip this if you’ll connect interactively.

③ Install & connect. On each of the two cloud nodes (Amazon Linux 2023):

curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up                 # authenticate via the printed URL (headless: --authkey tskey-… )
tailscale ip -4                   # this node's 100.x address — used directly as --node-ip in ch.6

④ Verify. If both nodes appear in the admin console Machines page (login.tailscale.com/admin/machines) with their 100.x address and hostname, it worked.

You can also check from the node:

tailscale status                  # list of machines in the tailnet + each one's 100.x

With this, the two cloud nodes see each other by 100.x in one tailnet. Now I bring up k3s with these addresses. (Tailscale Linux install)

6. Installing k3s (with Tailscale addresses)

Put the 100.x you got in chapter 5 straight into --node-ip.

Bootstrap & join flow

server-A (8GB):

curl -sfL https://get.k3s.io | K3S_TOKEN=<shared-secret> INSTALL_K3S_VERSION=v1.34.3+k3s1 \
  sh -s - server \
    --cluster-init \
    --node-ip 100.71.x.x \
    --node-external-ip <publicA> \
    --advertise-address 100.71.x.x \
    --flannel-backend vxlan

--cluster-init — initializes embedded etcd as the first server. (server flags)
--node-ip 100.71.x.x — advertises the Tailscale address received in ch.5 as the InternalIP.
--node-external-ip / --advertise-address — public IP (for external exposure), apiserver advertise address (Tailscale).
--flannel-backend vxlan — CNI backend (the default, stated explicitly).

K3S_TOKEN can be a value you set yourself, like choosing a password, or left blank for k3s to generate automatically. But since you need to know this value to join, save it separately or just pass the value at the path below.

/var/lib/rancher/k3s/server/node-token

server-B (16GB) — joins as the second server. This node, too, joins the tailnet first, then just connects with the same token:

curl -sfL https://get.k3s.io | K3S_TOKEN=<secret> INSTALL_K3S_VERSION=v1.34.3+k3s1 \
  sh -s - server \
    --server https://172.26.x.x:6443 \
    --node-ip 100.99.x.x

--server https://172.26.x.x:6443 = server-A’s address (a private IP, since it’s the same VPC).
--node-ip 100.99.x.x = this node’s Tailscale address.

The two Lightsail boxes are in the same AWS VPC, so joining itself used the private IP, but the InternalIP advertised to the cluster is Tailscale (100.x) for both.

Firewall — open only the minimum externally. (requirements)

port	use	exposure
80 / 443	Traefik Ingress	all
22	SSH	my IP only
6443 / 2379-2380 / 8472 / 10250	apiserver·etcd·flannel·kubelet	closed publicly, private/Tailscale internal only

7. Cluster setup — complete with two nodes

Attaching the home iMac as an agent is covered in the next article.

For now I’ve built the cluster with two Lightsail boxes, Tailscale applied. Listing the nodes, you can confirm both are Ready on the same version and runtime.

kubectl get nodes -o wide

NAME           STATUS  ROLES               AGE   VERSION       INTERNAL-IP   EXTERNAL-IP    OS-IMAGE                       KERNEL-VERSION              CONTAINER-RUNTIME
…3-146(8GB)    Ready   control-plane,etcd  139d  v1.34.3+k3s1  100.71.x.x    52.x.x.x       Amazon Linux 2023.7.20250512   6.1.134-…amzn2023.x86_64    containerd://2.1.5-k3s1
…2-70(16GB)    Ready   control-plane,etcd  139d  v1.34.3+k3s1  100.99.x.x    3.x.x.x        Amazon Linux 2023.9.20251105   6.1.156-…amzn2023.x86_64    containerd://2.1.5-k3s1

Check whether the two nodes are etcd voting members (look at Conditions in kubectl describe node <name>):

Conditions:
  Type             Status   Reason                       Message
  ----             ------   ------                       -------
  EtcdIsVoter      True     MemberNotLearner             Node is a voting member of the etcd cluster
  MemoryPressure   False    KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False    KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False    KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True     KubeletReady                 kubelet is posting ready status

Check that the k3s default bundle came up too (kubectl get pods -n kube-system):

# kubectl get pods -n kube-system  →  k3s default bundle only (excerpt)
coredns-7f496c8d7d-nx9jc                  1/1  Running    139d   # DNS
local-path-provisioner-578895bd58-mgxpm   1/1  Running    139d   # local storage (default SC)
metrics-server-7b9c9c4b9c-76ldg           1/1  Running    139d   # metrics (kubectl top)
traefik-78df465dcc-66kn8                  1/1  Running    9d     # Ingress (server-A)
traefik-78df465dcc-gs4q7                  1/1  Running    8d     # Ingress (server-B) → one per node = 2 replicas
helm-install-traefik-crd-pmk4t            0/1  Completed  139d   # Helm Job that installed the bundle (completed)

That concludes setting up two cloud instances as a k3s cluster. It isn’t just that I installed k3s — I also configured Tailscale so that, later, any machine can join as an agent regardless of where it is or what form it takes, as long as it’s an environment where k3s can be configured.

8. Next

The AWS Lightsail nodes are now formed into a cluster, and the groundwork for nodes to join is all set.

In the end it came down to one command per node, but this stage took more time than I expected.

To this two-node cluster, I’ll now bring in the iMac resting at home, in earnest. I’ll install Lima VMs on the iMac, create an agent on each, join them to the same tailnet, and write up the problems I ran into after joining — solving them along the way.

References

k3s — What is K3s / Architecture / Datastore: https://docs.k3s.io/ · /architecture · /datastore
k3s — HA Embedded etcd / Server flags / etcd-snapshot / Requirements: https://docs.k3s.io/datastore/ha-embedded · /cli/server · /cli/etcd-snapshot · /installation/requirements
Lightweight distro comparison (k3s·k0s·MicroK8s): https://palark.com/blog/small-local-kubernetes-comparison/ · https://www.portainer.io/blog/k0s-vs-k3s · https://www.nops.io/blog/k0s-vs-k3s-vs-k8s/
Tailscale — Linux install: https://tailscale.com/kb/1031/install-linux
AWS Lightsail — burst CPU / baseline: https://docs.aws.amazon.com/lightsail/latest/userguide/baseline-cpu-performance.html

Hybrid k3s #1: Cloud and home into one cluster — initial setup