Wednesday, 11 January 2023

k8s and EKS tips I gathered so far

I wanted to share some tips I learned along the way about K8S and AWS EKS. )

  1.  504 errors when using AWS load balancers. It will usually happen when doing a rolling update. AWS's official response was:

                 command: ['/bin/sh', '-c', 'sleep <SLEEP_TIME>']
     terminationGracePeriodSeconds: <GRACE_PERIOD_TIME>

    Basically, you need to put sleep in order for the load balancer to drain the connections. More info can be found here . 

  2. Verify that your operators/software is up do date. In my case, I had issues with old Calico version, that caused 502/504 errors because of bug they had. More info can be found here and here.
  3. Balance highly available pods over multiple pods. Using 2 or 3 replicas in the same node wont make it highly available if the node crash. Use k8s tools and descheduler to achieve it. You can read more about it here.
  4. Use load balancers with care. Every load balancer is very expensive. AWS allows you to groups multiple ingress into one load balancer using " my-group
    ". You can read about it here and here.

  5. Dont use CPU limits. Use memory limits, but don't use CPU limits!
    The basics are, that CPU limit is not ideal (because of how Linux kernel implemented it) , and CPU limits does not explain K8S how to split CPU between pods, which will cause a scenario where one pod starves the CPU of the node, and other pods will timeout/die. Using CPU requests will allow K8S to prioritize CPU between pods without causing starvation. If a POD does not have CPU requests you have to assume it will get starved to death
    I had to find it the hard way! You can read this as well.
  6. Use log management system, like Fluentd + ELK so you can have all your logs and alerts across all services.
  7. Namespaces does not provide isolation network wise. Use Calico + Network Policies to prevent network connections between namespaces + allow only specific services like logging system. 
  8. Use CI/CD - it can be ArgoCD or it can be any other. I used
  9. Upgrade K8S regularly - although its a nightmare. You can use staging env, pluto, kube no trouble. Probably maybe AWS things are going to break, like aws-loadbalancer-controller, autoscaler and more. Read their instructions and their compatibility before the upgrade. Read other AWS docs as well. For example, CDK does not really support new versions of EKS
  10. Use your IDE to manage the cluster (using K8S extension). In many times its so much faster. I use VSCode, and in terms of the K8S extension its better IMO than IDEA one.
  11. Use Minikube - for everything you can in the development cycle. Dont deploy to staging/prod and wait for things to die. Test them locally. Minikube is a great tool, and it allows to test many things before deploy. This is a huge advantage compared to other cloud services, so use it. 

No comments:

Post a Comment