I wanted to share some tips I learned along the way about K8S and AWS EKS.
- 504 errors when using AWS load balancers. It will usually happen when doing a rolling update. AWS's official response was:
lifecycle:
preStop:
exec:
command: ['/bin/sh', '-c', 'sleep <SLEEP_TIME>']
terminationGracePeriodSeconds: <GRACE_PERIOD_TIME>
Basically, you need to put sleep in order for the load balancer to drain the connections. More info can be found here . - Verify that your operators/software is up do date. In my case, I had issues with old Calico version, that caused 502/504 errors because of bug they had. More info can be found here and here.
- Balance highly available pods over multiple pods. Using 2 or 3 replicas in the same node wont make it highly available if the node crash. Use k8s tools and descheduler to achieve it. You can read more about it here.
- Use load balancers with care. Every load balancer is very expensive. AWS allows you to groups multiple ingress into one load balancer using "alb.ingress.kubernetes.io/group.name:
my-group
". You can read about it here and here. - Dont use CPU limits. Use memory limits, but don't use CPU limits!
The basics are, that CPU limit is not ideal (because of how Linux kernel implemented it) , and CPU limits does not explain K8S how to split CPU between pods, which will cause a scenario where one pod starves the CPU of the node, and other pods will timeout/die. Using CPU requests will allow K8S to prioritize CPU between pods without causing starvation. If a POD does not have CPU requests you have to assume it will get starved to death!
I had to find it the hard way! You can read this as well. - Use log management system, like Fluentd + ELK so you can have all your logs and alerts across all services.
- Namespaces does not provide isolation network wise. Use Calico + Network Policies to prevent network connections between namespaces + allow only specific services like logging system.
- Use CI/CD - it can be ArgoCD or it can be any other. I used https://github.com/actions/actions-runner-controller
- Upgrade K8S regularly - although its a nightmare. You can use staging env, pluto, kube no trouble. Probably maybe AWS things are going to break, like aws-loadbalancer-controller, autoscaler and more. Read their instructions and their compatibility before the upgrade. Read other AWS docs as well. For example, CDK does not really support new versions of EKS https://github.com/aws/aws-cdk/issues/19843
- Use your IDE to manage the cluster (using K8S extension). In many times its so much faster. I use VSCode, and in terms of the K8S extension its better IMO than IDEA one.
- Use Minikube - for everything you can in the development cycle. Dont deploy to staging/prod and wait for things to die. Test them locally. Minikube is a great tool, and it allows to test many things before deploy. This is a huge advantage compared to other cloud services, so use it.
No comments:
Post a Comment