Sunday, 21 August 2022

AWS EKS 502 503 504 gateway errors and pricing

 I have used K8S for quite some time now, and I actually quite enjoyed it, but like every good piece of software, AWS has to come and fuck everything up. 



I am using AWS ALB in my K8S cluster and getting tons of that gateway error when rolling updates or when pods try to move between pods. I am going to share my experience on how I handled those and what was the solutions for me.


504 Gateway timeout

Issue 1

First, this seems to happen when doing a rolling update. There is much info about it on the internet and Github (seems like many people are affected by it), yet nothing official from AWS. I don't want to just copy-paste code from the internet when I pay a lot of money for AWS and it should work out of the box. I contacted AWS - and provided an easily reproducible code: 


And they provided me with the exact solution the internet gave me...


lifecycle: preStop: exec: command: ['/bin/sh', '-c', 'sleep <SLEEP_TIME>'] terminationGracePeriodSeconds: <GRACE_PERIOD_TIME>

The reason we need this lifecycle rule is that the ALB needs to "drain" already existing connections. For example, if your app can take 30 seconds to handle the connection, we need to wait for 30 seconds for all the existing connections to drain before we can kill the pod. When the ALB endpoint enters drain mode, the pod should not get any new connections but should keep handling existing ones. This sleep is stupid, but its keeping the pod from being killed during drain, and its what AWS recommended.. so yeah. 


Issue 2

in my case, this was not the only issue. I would still get 502/504 errors when doing a rolling update, but also when pods are trying to move between nodes. I started by debugging. First, we need to run WGET in a loop to monitor the gateway errors:

for i in {1..10000}; do

    wget -O- https://some-alb-endpoint

done

When running this and killing pods I can see sometimes I get a 504/503 HTTP errros in the wget response.

Great.. so it reproduces quite easily, lets fix it. 

I started by monitoring AWS ALB endpoints to see what happens. I killed a pod, and observed as endpoint enters a draining mode, which makes sense. 

I then created a dummy pod nearby with netcat (https://en.wikipedia.org/wiki/Netcat), and tried to netcat the pod while its teminating. On theory it should work. In the teminating/sleep period of the pod, the pod should still be available to handle connections. It should not get any new connections, but existing connections should be kept alive. They are not.. so here is our problem - the pods stops handling new connections when entering a terminating state. 

I thought of can cause this kind of issue - and I thought of my usage of Calico (https://www.tigera.io/project-calico/) . Searching the Github I found https://github.com/projectcalico/calico/issues/4518 which was the exact thing I was facing. Upgrading Calico with the fix solved the issue I had!


502 Gateway timeout

Looking at the logs, sometimes I would get this error. Cross pathing logs it seems to happen when a node terminates (usually spot request run out). Looking at which Pods running on which Node, I can see that sometimes a specific node runs all the replicas of a certain service ( 2 replicas). This might cause a service outage when the node goes down, as the replicas might not be able to fully run in another node in a very short time (pulling images, health probes, etc) - which could explain the issue we are having. 

Reading about how K8S distributes workloads, it seems I need at least 3 replicas for K8S to distribute to load between nodes. Being smart-ass and cheap, with the fact that we have relatively small traffic and really 3 replicas is too much, I want to find a workaround. I found this trick:



https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/ ) 

But apparently, it's not enough. Apparently, a situation for example where not enough nodes are healthy will cause K8S to spawn in the same node anyway (whenUnsatisfiable: ScheduleAnyway) causing the pods to not distribute. 

For that we have https://github.com/kubernetes-sigs/descheduler which just kills pods which are not meeting some ciriterias like the one above. 

We those in place - it seems the error is gone, even when the node terminates. 

Load balancers are expensive!

up until this point, I was creating an Load Balancer for every service, which is very expensive. Out of the box, I did not find an answer in the docs, so I reached the local Facebook community and contacted AWS support. They provided me with the following attribute: 

alb.ingress.kubernetes.io/group.name

This lets you combine multiple services into one LB, which can save you tons of money if you create an LB per service and have many services.  

Conculsion


Some of the issues are not related to AWS and I might be harsh on AWS, but there is a lot of AWS "gotchas" in EKS. Sometimes I am not sure if it's an AWS problem or an EKS issue because of the many AWS issues I am facing. 
EKS is far from the first time I encountered this phenomenon in AWS. 
I had really bad issues with AWS Amplify, or when hitting AWS Lambda cold start issues, using EMR and it being super slow and insanely difficult to use compared to other solutions and much much more. 
Please also read https://breaking-the-system.blogspot.com/2022/08/why-kubernetes-is-so-important-for.html if you want to know more about my opinions on the matter. 

Thanks for reading and I hope I could help. 


No comments:

Post a Comment