Testing out IPVS mode with minikube
Starting from kubernetes version 1.9 there is a new promising IPVS mode in kube-proxy. One of its advantages is the possibility to pick a load-balancing method: RR, least connected, source/destination hashing, shortest delay plus some variations of those. What I was interested in are the LC methods, as they’re somewhat better than iptables’ RR in terms of handling inbound traffic under load.
Trying it out
So, to use IPVS one should either pass --proxy-mode ipvs
to kube-proxy, or set the equivalent option in the config file. As it turned out,
minikube doesn’t provide a way to pass extra config to kube-proxy…
Digging deeper into how things are set up inside minikube VM (as of 0.28.2), I’ve
found out that kube-proxy is launched inside its own pod, with config file being
mounted from a ConfigMap. After k8s boot-up with the default settings, we’ll just
edit relevant config (set mode: "ipvs"
in config.conf):
kubectl edit -n kube-system configmap/kube-proxy
To apply new configuration, delete the old pod and k8s will create a new one, as required by corresponding DaemonSet:
$ kc get -n kube-system pods
NAME READY STATUS RESTARTS AGE
...
kube-proxy-49psk 1/1 Running 0 11h
...
$ kc delete -n kube-system po/kube-proxy-49psk
pod "kube-proxy-49psk" deleted
$ kc get -n kube-system pods
NAME READY STATUS RESTARTS AGE
...
kube-proxy-x7qgq 1/1 Running 0 7m
...
Now let’s fire up kube-proxy logs and see how this went:
$ kc logs -n kube-system po/kube-proxy-x7qgq
E0805 09:46:12.625751 1 ipset.go:156] Failed to make sure ip set: &{{KUBE-CLUSTER-IP hash:ip,port inet 1024 65536 0-65535 Kubernetes service cluster ip + port for masquerade purpose} map[] 0xc420562080} exist, error: error creating ipset KUBE-CLUSTER-IP, error: exit status 1
E0805 09:46:42.645604 1 ipset.go:156] Failed to make sure ip set: &{{KUBE-LOAD-BALANCER-FW hash:ip,port inet 1024 65536 0-65535 Kubernetes service load balancer ip + port for load balancer with sourceRange} map[] 0xc420562080} exist, error: error creating ipset KUBE-LOAD-BALANCER-FW, error: exit status 1
E0805 09:47:12.677159 1 ipset.go:156] Failed to make sure ip set: &{{KUBE-NODE-PORT-UDP bitmap:port inet 1024 65536 0-65535 Kubernetes nodeport UDP port for masquerade purpose} map[] 0xc420562080} exist, error: error creating ipset KUBE-NODE-PORT-UDP, error: exit status 1
E0805 09:47:42.748946 1 ipset.go:156] Failed to make sure ip set: &{{KUBE-NODE-PORT-LOCAL-TCP bitmap:port inet 1024 65536 0-65535 Kubernetes nodeport TCP port with externalTrafficPolicy=local} map[] 0xc420562080} exist, error: error creating ipset KUBE-NODE-PORT-LOCAL-TCP, error: exit status 1
Well, that doesn’t look like we’re good to go with IPVS… The problem is that the kernel, used in minikube, lacks a few ipset-related modules, providing hash types, used by kube-proxy in IPVS mode. I’m not aware of the method to add extra modules to minikube’s kernel, other than building a custom image, so we need to go deeper…
Building custom minikube.iso
According to documentation, this is as simple as clone & make:
$ git clone https://github.com/kubernetes/minikube
$ cd minikube
$ make buildroot-image
$ make out/minikube.iso
Note, that you’ll need docker for this to succeed, as everything is being done inside a container (who needs a shitload of build dependencies on their laptop?). Performing this steps on MacOS produced a strange error:
--- snip ---
echo 'csu/init-first.o csu/libc-start.o csu/sysdep.o csu/version.o csu/check_fds.o csu/libc-tls.o csu/elf-init.o csu/dso_handle.o csu/errno.o csu/errno-loc.o' > /mnt/out/buildroot/output/build/glibc-glibc-2.27-57-g6c99e37f6fb640a50a3113b2dbee5d5389843c1e/build/csu/stamp.oT
mv -f /mnt/out/buildroot/output/build/glibc-glibc-2.27-57-g6c99e37f6fb640a50a3113b2dbee5d5389843c1e/build/csu/stamp.oT /mnt/out/buildroot/output/build/glibc-glibc-2.27-57-g6c99e37f6fb640a50a3113b2dbee5d5389843c1e/build/csu/stamp.o
echo 'csu/init-first.os csu/libc-start.os csu/sysdep.os csu/version.os csu/check_fds.os csu/dso_handle.os csu/unwind-resume.os csu/errno.os csu/errno-loc.os' > /mnt/out/buildroot/output/build/glibc-glibc-2.27-57-g6c99e37f6fb640a50a3113b2dbee5d5389843c1e/build/csu/stamp.osT
echo 'csu/elf-init.oS' > /mnt/out/buildroot/output/build/glibc-glibc-2.27-57-g6c99e37f6fb640a50a3113b2dbee5d5389843c1e/build/csu/stamp.oST
mv -f /mnt/out/buildroot/output/build/glibc-glibc-2.27-57-g6c99e37f6fb640a50a3113b2dbee5d5389843c1e/build/csu/stamp.osT /mnt/out/buildroot/output/build/glibc-glibc-2.27-57-g6c99e37f6fb640a50a3113b2dbee5d5389843c1e/build/csu/stamp.os
mv -f /mnt/out/buildroot/output/build/glibc-glibc-2.27-57-g6c99e37f6fb640a50a3113b2dbee5d5389843c1e/build/csu/stamp.oST /mnt/out/buildroot/output/build/glibc-glibc-2.27-57-g6c99e37f6fb640a50a3113b2dbee5d5389843c1e/build/csu/stamp.oS
/mnt/out/buildroot/output/host/bin/x86_64-minikube-linux-gnu-gcc -nostdlib -nostartfiles -r -o /mnt/out/buildroot/output/build/glibc-glibc-2.27-57-g6c99e37f6fb640a50a3113b2dbee5d5389843c1e/build/csu/crt1.o /mnt/out/buildroot/output/build/glibc-glibc-2.27-57-g6c99e37f6fb640a50a3113b2dbee5d5389843c1e/build/csu/start.o /mnt/out/buildroot/output/build/glibc-glibc-2.27-57-g6c99e37f6fb640a50a3113b2dbee5d5389843c1e/build/csu/abi-note.o /mnt/out/buildroot/output/build/glibc-glibc-2.27-57-g6c99e37f6fb640a50a3113b2dbee5d5389843c1e/build/csu/init.o /mnt/out/buildroot/output/build/glibc-glibc-2.27-57-g6c99e37f6fb640a50a3113b2dbee5d5389843c1e/build/csu/static-reloc.o
mv: cannot move '/mnt/out/buildroot/output/build/glibc-glibc-2.27-57-g6c99e37f6fb640a50a3113b2dbee5d5389843c1e/build/csu/stamp.oST' to '/mnt/out/buildroot/output/build/glibc-glibc-2.27-57-g6c99e37f6fb640a50a3113b2dbee5d5389843c1e/build/csu/stamp.oS': No such file or directory
../o-iterator.mk:9: recipe for target '/mnt/out/buildroot/output/build/glibc-glibc-2.27-57-g6c99e37f6fb640a50a3113b2dbee5d5389843c1e/build/csu/stamp.oS' failed
make[5]: *** [/mnt/out/buildroot/output/build/glibc-glibc-2.27-57-g6c99e37f6fb640a50a3113b2dbee5d5389843c1e/build/csu/stamp.oS] Error 1
make[5]: *** Waiting for unfinished jobs....
make[5]: Leaving directory '/mnt/out/buildroot/output/build/glibc-glibc-2.27-57-g6c99e37f6fb640a50a3113b2dbee5d5389843c1e/csu'
Makefile:215: recipe for target 'csu/subdir_lib' failed
make[4]: *** [csu/subdir_lib] Error 2
make[4]: Leaving directory '/mnt/out/buildroot/output/build/glibc-glibc-2.27-57-g6c99e37f6fb640a50a3113b2dbee5d5389843c1e'
Makefile:9: recipe for target 'all' failed
make[3]: *** [all] Error 2
make[3]: Leaving directory '/mnt/out/buildroot/output/build/glibc-glibc-2.27-57-g6c99e37f6fb640a50a3113b2dbee5d5389843c1e/build'
package/pkg-generic.mk:223: recipe for target '/mnt/out/buildroot/output/build/glibc-glibc-2.27-57-g6c99e37f6fb640a50a3113b2dbee5d5389843c1e/.stamp_built' failed
--- snip ---
At first, these messages doesn’t make sense - make
is unable to move the file
it created a few commands ago. But if you consider previous similar command, it
clicks: Apple uses case-insensitive FS by default! Minus 10 points to those who
came up with the idea of using case-differing file extensions for build process,
though. To overcome this obstacle, I’ve adopted the trick from one poor soul
trying to build a cross-compilation toolchain for RPi 3 - use a
case-sensitive disk image as buildroot’s output directory:
$ rm -rf out
$ hdiutil create -type SPARSE -fs 'Case-sensitive Journaled HFS+' -volname buildroot_out -size 20g ./buildroot_out.sparseimage
$ hdiutil attach -mountpoint ./out ./buildroot_out.sparseimage
$ make out/minikube.iso
You need to be careful here, though - depending on how your docker file sharing is set up, you may need to ssh into the VM and ensure that correct volume is mounted at
out
. If it’s NFS, for example, you’ll need to add a line in/etc/exports
, restart nfsd, and explicitly mount that dir inside a VM.
Once the build finishes, we need to tweak the Linux kernel configuration to
include ipset modules. Minikube’s custom ISO documentation says it can be done
with make linux-menuconfig
, but for MacOS it’s not the case (menuconfig won’t
start). Also, the kernel version, specified in the minikube’s Makefile, doesn’t
match one used to build an ISO - you’ll need to correct this as well (kernel
source can be found in out/buildroot/output/build/linux-4*
; in my case, it was
linux-4.15):
diff --git a/Makefile b/Makefile
index 50f113aab..c3905d62e 100755
--- a/Makefile
+++ b/Makefile
@@ -33,7 +33,7 @@ MINIKUBE_VERSION ?= $(ISO_VERSION)
MINIKUBE_BUCKET ?= minikube/releases
MINIKUBE_UPLOAD_LOCATION := gs://${MINIKUBE_BUCKET}
-KERNEL_VERSION ?= 4.16.14
+KERNEL_VERSION ?= 4.15
GOOS ?= $(shell go env GOOS)
GOARCH ?= $(shell go env GOARCH)
Finally, to configure the kernel we’ll reuse buildroot’s docker image:
$ docker run -it --rm -v $PWD:/minikube gcr.io/k8s-minikube/buildroot-image bash
root@559247e3fa03:/# apt-get update
root@559247e3fa03:/# apt-get install -y --no-install-recommends libncurses-dev
root@559247e3fa03:/# cd /minikube
root@559247e3fa03:/minikube# make linux-menuconfig
Go to Network > Netfilter > IP Sets, tick all hash types as modules,
save & exit. Now build an ISO once more (it should be fast). To use the new
image, delete the old cluster and recreate it with our new --iso-url
:
$ minikube stop && minikube delete
$ minikube start --iso-url=file://$PWD/out/minikube.iso
After cluster starts up, repeat the steps we’ve done to switch to IPVS in kube-proxy and voilà:
I0811 21:26:07.996804 1 feature_gate.go:230] feature gates: &{map[]}
I0811 21:26:08.064640 1 server_others.go:183] Using ipvs Proxier.
W0811 21:26:08.086817 1 proxier.go:349] clusterCIDR not specified, unable to distinguish between internal and external traffic
W0811 21:26:08.086847 1 proxier.go:355] IPVS scheduler not specified, use rr by default
I0811 21:26:08.087178 1 server_others.go:210] Tearing down inactive rules.
I0811 21:26:08.142232 1 server.go:448] Version: v1.11.0
I0811 21:26:08.154958 1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
I0811 21:26:08.155260 1 conntrack.go:52] Setting nf_conntrack_max to 131072
I0811 21:26:08.155338 1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400
I0811 21:26:08.155394 1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
I0811 21:26:08.155634 1 config.go:102] Starting endpoints config controller
I0811 21:26:08.155661 1 controller_utils.go:1025] Waiting for caches to sync for endpoints config controller
I0811 21:26:08.155703 1 config.go:202] Starting service config controller
I0811 21:26:08.155709 1 controller_utils.go:1025] Waiting for caches to sync for service config controller
I0811 21:26:08.256254 1 controller_utils.go:1032] Caches are synced for service config controller
I0811 21:26:08.256369 1 controller_utils.go:1032] Caches are synced for endpoints config controller
Now lets see which ipset modules are really needed (e.g. to submit a patch to minikube). And also check out the IPVS configuration, that was done by the proxy:
$ kc exec -it kube-proxy-lxj2d ipset list | grep Type | sort -u
Type: bitmap:port
Type: hash:ip,port
Type: hash:ip,port,ip
Type: hash:ip,port,net
$ kc exec -it kube-proxy-lxj2d ipvsadm
OCI runtime exec failed: exec failed: container_linux.go:348: starting container process caused "exec: \"ipvsadm\": executable file not found in $PATH": unknown
command terminated with exit code 126
Oh, not again! No ipvsadm
in kube-proxy image… Screw it, just install a
package via toolbox:
$ minikube ssh
$ toolbox
[root@minikube ~]# dnf -y install ipvsadm
[root@minikube ~]# ipvsadm
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP minikube:https rr
-> minikube:pcsync-https Masq 1 0 0
TCP minikube:domain rr
-> 172.17.0.2:domain Masq 1 0 0
-> 172.17.0.3:domain Masq 1 0 0
TCP minikube:http rr
-> 172.17.0.4:websm Masq 1 0 0
TCP localhost:ndmps rr
-> 172.17.0.4:websm Masq 1 0 0
TCP minikube:ndmps rr
-> 172.17.0.4:websm Masq 1 0 0
TCP minikube:ndmps rr
-> 172.17.0.4:websm Masq 1 0 0
UDP minikube:domain rr
-> 172.17.0.2:domain Masq 1 0 0
-> 172.17.0.3:domain Masq 1 0 0
As you would’ve guessed by now, this didn’t worked on the first try :)
dnf
failed on fetching package indices due to low disk space. Somehow I’ve only got 1G on rootfs… Worked this around by mounting tmpfs at/var/cache/dnf
.
I don’t understand all of the output, but there is certainly a DNS service on TCP
and UDP. As you can see, IPVS scheduler for all the virtual services is rr
.
Trying it out, for real
First, I’d like to check the default scheduler. We’ll spin up an http server and bench it with ab.
$ kc run hello-server --image kennethreitz/httpbin --port 80 --replicas 2
$ kc expose deployment hello-server
$ # Wait till it's ready
$ kc get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
hello-server ClusterIP 10.104.236.6 <none> 80/TCP 37m
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 3h
$ ab -c 10 -n 100 "http://10.104.236.6/ip"
[root@minikube ~]# ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
--- snip ---
TCP 10.104.236.6:80 rr
-> 172.17.0.5:80 Masq 1 0 50
-> 172.17.0.6:80 Masq 1 0 50
As expected, requests have landed 50/50 on both pods. Next we’ll change the
scheduler to lc
. To do this, edit the configmap once again and delete
kube-proxy’s pod. Then, to simulate an uneven load, I’ve came up with this
brittle scenario:
- scale the deployment to a single pod
- start a bunch of long requests
- scale it back to two pods
- run the test
$ kc scale deployment hello-server --replicas 1
$ # Wait till the second pod dies
$ (ab -c 10 -n 10 "http://10.104.236.6/delay/10" &) && \
kc scale deployment hello-server --replicas 2 && \
sleep 8 && \
ab -c 10 -n 100 "http://10.104.236.6/ip"
[root@minikube ~]# ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
--- snip ---
TCP 10.104.236.6:80 lc
-> 172.17.0.5:80 Masq 1 0 10
-> 172.17.0.6:80 Masq 1 0 100
Yay, it worked! While the first pod was “busy” serving delayed requests, the second handled a hundred of short ones.
SUCCESS 🎉🎉🎉