How Kubernetes taught me im an idiot

9 minute read

I Have been a DevSecOps Engineer / Computer Systems Engineer for roughly 12 years now and i love my job there is always somthing to learn including humbling lessons about being pigheadded and this is one of those times technology got the better of me and made me feel like an idiot.

Background

I have recently just upgraded my office to make it a much more suitable WFH environment as Covid has really eliminated the concept of working from an office for me, Part of that upgrade included a nice fancy new server rack with a 1080KG weight capacity and 27U’s of rack space which is awesome but the servers i have to put in it are quite antiquated now being Dell Poweredge 1950’s, I learned very quickly that in the rack they get hot and use 7 tonns of power per minute and like everyone i try to be frugal and save money where i can so i decided ill only use the 1950’s for lab testing and shut them down when not activly testing to save some power and heat in my office. Thats great but what about the services i run like this site and discord bots and all those kinds of things i enjoy hosting/working on ?.

Enter Linode, As a DevOps engineer i do a lot of work with kubernetes and containers so i though well ill just spin up a LKE cluster for $80 a month easy peasy money saved plenty of capacity to keep me entertained and as it turns out more problems than i bargained for.

The Plan

The plan was to just use an out of the box turnkey kubernetes solution like Linodes LKE platform (which is the solution i chose and am using for this post) with the goal of making the most simple solution i can, I deal with massivly complicated Kubernetes environments daily for work with all sorts of regulatory and business requirments imposed all over the place but i dont need any of that KISS (Keep it Simple Stupid) is what i really needed for my personal use.

Requirements

Kubernetes - A Places to run projects im working on.

IngressController - I went with Traefik mainly becaus i know it the best and i really wanted to limit the cluster to just 1 loadbalancer.

SSL/TLS - What a Nightmare!!

CI/CD - Drone.io

The Begining

To start with i logged into my trusty Linode account and provisioned an LKE cluster with 4 nodes with 1 VCPU and 4GB ram each on shared resources and i dont need dedicated resources for my workloads im happy to wait for compute if needed.

I then deployed traefik a relativly easy an well documented process that i wont touch on too much as this took less than 5 minutes.

I then reched out to some friends in TheSetkehProject Keybase community (If you havent joined us you should) for ideas on what i should do for CI/CD and was presented with two great options Drone.io and Tekton and ultimatly decided to roll with Drone as i have used it about 7-8 years ago before the Harness aquisition.

The Fun Begins

Installing the Drone Server went smoothly and was painless in the beginning, Then the time came to install the Runners so we could actually run some CI/CD jobs, I obviously chose to go down the Kubernetes runners route as i have a cluster at my disposal, The installation of the deployment was fairly straight forward RTFM it works.

The first hurdle was encountered with the runners though i had initially planned to install the runners in the drone namespace i created for CI/CD so it was nice and compact and contained all in one place but the way drone interacts with service accounts in kubernetes made this impractical (there may be a way to fix this but i spent far to many hours trying to fix it that im not going to poke the beast until im significatly bored enough lol) the solution ended up being to puth the runners into the default namespace and using the default service account and attaching a cluster role binding to they could deploy resources outside of the default namespace.

You can see my drone manifest files i used to get it working At this Gist

With the help of my mate Armageddon this was figured out and working in around 8 hours or so with plenty of trial and error.

CI/CD is working now the next step is lets configure a service with CI/CD and deploy it to establish the model for how to do this moving forward.

The Nightmare Begins

I selected the Oceanus Project to be the gunea pig for this so i got to writing some code to make it a much more cloud friendly API and once done i created the pipeline to deploy it and the Kubernetes manifests to create the required resources and for the most part this went without a hitch but the section title should give some clues to whats comming next.

Just to add some context before we get into the Nightmare that unfolded and how kubernetes taught me im an idiot.

I have been using Namecheap as my DNS registrar for about 9 years they have great customer support and a wide variety of services for a good price and a lot of tools and integrations that make managing your digital assets they provide quite easy, That is also why i did thing very manually for DNS and SSL i own a Wildcard cert or 5 and 20 somthing domains and i used the namecheap DNS servers to configure my domains.

Now i made a decision when i installed Traefik that i would by default use the https redirect middleware so it was impossible to ingress into any service on the cluster without SSL/TLS (unless of course you kubectl proxy to a service) and i also did no want each app to maintain its own SSL/TLS certs as that gets expencive and complicated with SSL passthrough, So i began looking into CertManager / LetsEncrypt which is considered to be the solution to this issue and i have and do use it succesfully in Kubernetes clusters in alot of my professional work but during the setup of CertManager i ran into the first massive issue.

setkeh.com stopped resolving OpenDNS was reporting SRVFAIL when trying to resolve the domain so i was not able to access setkeh.com from my home network as i use OpenDNS as my DNS servers but i had no problems reaching setkeh.com from my cellular 5G connection or when i swapped from OpenDNS to my ISP provided DNS servers, The knock on effect to that is that Letsencrypt could also no longer resolve setkeh.com when it was trying to do HTTP-01 cert challenge to issue the service TLS certificates. No Problem as i said i own wildcards ill use those temporarily while i figure out the DNS issues with Namecheap support and conviniantly Traefik has a really easy way to do this using Kubernetes secrets.

So i went ahead and deleted the certmanager resources and created the secret containing my wildcard certificates fullchain and private key files as documented by Traefik and everything appeard to be working on my end i could hit https://oceanus.setkeh.com i got no warnings or errors but when Armageddon tried his browser reported a trust issue for the website so we spent far too many (6 or so) hours messing with it trying to fix the certs and resolve the trust problem using ssl checkers and any tool we could find to try and shed light on the issue but to no avail and by 4AM this morning Armageddon had given up and gone to bed and i was smashing my face on the keyboard in resigned fury.

Time to Sleep!!!!!!

I woke up about 6 hours later with the plan i would try swapping my DNS servers from Namecheap to Cloudflare to see if i could atleast fix the DNS issue and come at the SSL issue from a different angle so i began the nameserver swap and propagation while i waited for the ol kettle to boil.

The Idiot Revealed

Let me just start by saying if you dont know about Cloudflare you should go sign up its free (and no this is not sponsord you will understand in just a minute) and play with their services.

DNS Setup and Propagation is now complete Cloudflare has now started handling my domains DNS records and immidiatly the SRVFAIL issue has disapeard so i swap away from my ISP DNS server again and crack on things are looking up.

I noticed though that Cloudflare was also proxying my DNS requests so i did some more digging and found they are able to act as a proxy to the individual records for the domain thats cool because it does quite a few things like DDOS protection and load management and a few other things but it really came into its own when my lighbulb moment happened.

Why not just use the cloudflare generated letsencrypt certificates generated to proxy the requests as by default they were proxying the requests and enforcing SSL/TLS connections.

Cloudflare solved the client side TLS issue but i still had some server side issue from Cloudflare to Traefik because my Traefik configured wildcard was still not working properly thankfully Cloudflare will generate Origin server Certificates for you and give you the cert and key, So happily seeing the light at the end of the tunnel i created an origin cert and promptly updated the Kubernetes secret that contained my wildcard to now contain the origin server cert provided by Cloudflare and BOOM immidiatly all SSL trust issues vanished and i quickly verified with several of the many hundreds of tools available to check SSL certificate validity and trust.

This solution really is the Ferrari of everything we tried to this point as i now dont have each service in the clust4er deploying 5-8 resources just for SSL i can now just deploy the resources for the service and its IngressRoute and Cloudflare does all the other heavy lifting by default out of the box.

The Lesson

All this brings us back to the lesson i learned and why i was an idiot even with the vast experiance i have and even with the help of friends who also have vast experiance the problem had already been solved and i dident need to craft a solution to a problem that was already fixed and i now have the backing of a platform and a huge range of security benifits i would not have implimented in the solution without a very significant time investment that absolutly was not required.

While solving issue like this on your own sounds awesome (and it really does feel good) sometimes the wise path is to use the tools that already exist to solve the problem PigHeaddedness in this case cost me a weekend while it was not all bad i got to spend most of the weekend with my good mate Armageddon we both probably could have used that time to code some other cool service instead of troublshooting the bottom rung of the ladder.

The End

Thanks for stopping by i hope this post and my pain was entertaining, If you ever wanna hang out or ask a question you more than welcome to join ThesetkehProject on Keybase or find me on Libera IRC i frequent the #lebfug lebanese DevOps user group channel run by Armageddon.

Have a great day <3