AKS is the managed service from Azure for Kubernetes. When you create an AKS cluster, Azure creates and operates the Kubernetes control plane for you at no cost. The only thing you do as a user is to say how many worker nodes you’d like, plus other configurations we’ll see in this post. So, with that in mind, how can you improve the AKS cluster performance of a service in which Azure pretty much manages almost everything?
This post is going to cover critical considerations to improve AKS cluster performance of your workloads in Kubernetes, including the nuances on how Azure helps you do this. Let’s dive in.
You will configure the cluster and the application based on the workloads you want to run in Kubernetes. For example, if your workload is network-intense, you’d focus on how to make the cluster have better network throughput and low latencies. Or, if your workload is stateful, then you’d want to focus on the storage options you configure in the cluster. I’ll break down each of the common configurations and considerations you’ll use to run your workloads in Kubernetes smoothly.
AKS cluster Performance is not something you only care about once; it’s continuous improvement where you depend on the type of feedback you collect. In Kubernetes, the recommended way to understand the resource usage and performance of the applications is cAdvisor. You can install cAdvisor as a StatefulSets in Kubernetes to collect metrics in each worker node of the cluster.
Let’s look at a few categories where you can do something to improve your application’s performance.
Configuring the requests and limits for the pods is going to help the scheduler orchestrate your workloads more efficiently. Requests and limits are the numbers Kubernetes use to control resources in the cluster, such as CPU and memory. It’s not a good practice to deploy pods into Kubernetes without specifying how many resources the pod will need. When you configure a request in a pod, you’re telling Kubernetes the least number of resources a pod will need, and then Kubernetes will schedule accordingly. Limits are the numbers Kubernetes uses to control and restrict resources in the cluster for the pod.
In the case that pods don’t come with requests and limits, you can configure resources at the namespace level when sharing the cluster with different groups or applications. ResourceQuota is the object you create to request and limit the resources for all the pods in a specific namespace. If instead you’d like to apply default values to containers in pods, you can use the LimitRange object.
This is a big topic, so feel free to discover more on your own. You can find more information from the official AKS docs. And there’s also a good video from Google that explains how requests and limits work in pods.
Once you’ve identified and defined how many resources each pod will need, it’s time to do the math and determine how many worker nodes you’ll need in the cluster. It’s better if you choose a node with the minimum number of resources, but without going to the extreme of picking nodes that are too small. Also, a node shouldn’t be too big because, when you need to scale out to schedule only a few pods, you’ll waste resources and money.
You can then flag your nodes to dedicate them for specific workloads. For example, you can use node affinity to schedule pods in a node that has SSD storage or co-schedule pods in the same node. Or you can configure taints or toleration in the nodes to deny pods from being scheduled in certain nodes. For example, dedicate nodes in the cluster for front-end applications and other nodes for back-end applications.
Currently, AKS is working on allowing you to have multiple node pools for the same cluster. This will let you have a node pool with GPUs, and another node pool with fewer resources for non-critical workloads.
A Kubernetes cluster should be near where your customers are, even if you operate the cluster from a different location. If you happen to have customers in multiple locations, then it’s recommended that you have a cluster in each location. This type of architecture allows you not only to reduce latency, but also to switch traffic in case of a region failure. In Azure, the best option is to choose two paired regions, which are two regions near to each other physically. Azure will prioritize recovery in case of failure, or coordinate maintenance without affecting the paired region.
Traffic manager is the service that will help you to route traffic between different AKS clusters. It’s possible to route traffic based on latency, geography, or failure. Users will hit a DNS endpoint that routes to the traffic manager, and then the traffic manager will return the AKS endpoint that the user can connect to directly.
When you have clusters in multiple regions, you’ll need to replicate data near the cluster—for example, the container images repositories, data volumes, or databases. You can find more information about this topic in the AKS official docs.
There are two ways to configure networking in AKS:
If you want to connect the AKS cluster with a current resource either in Azure or on premises, choose the advanced option. Otherwise, with the basic model, you’ll need to create a route to connect to other networks. Therefore, you’ll end up reducing network performance and might create a complicated configuration.
Furthermore, make sure that the subnet assigned to the AKS cluster doesn’t overlap with any other network range in your organization. Moreover, the address space needs to be sufficient because each pod will have an IP address from the subnet. When AKS creates more pods, more IP addresses will be required, so plan accordingly to avoid having problems with your application workloads.
You might also consider using native ingress resources and controllers versus the regular Azure load balancer. Why? Azure load balancers are layer four, and ingress is layer seven, which means that with an ingress you can offload SSL certificates or create complex configurations based on path and use only one IP address.
You can find more information about how to configure networks properly from a security perspective in the AKS official docs site.
Even though your workloads might be stateless and you don’t need to configure volumes, having a suitable storage type will help to improve AKS cluster performance, such as when pulling images from the container registry, for example.
For production environments, use SSD storage. And in case you need to have concurrent connections, use a network storage type. In Azure, storage types translate into using Azure Files, Azure managed disks (SSD), dysk (preview), or blobfuse (preview).
Beware that each node has a limit for how many disks it can have attached. Furthermore, the node size could determine the storage performance you get in the cluster. CPU and memory are not the only resource types you need to consider when choosing the node size. There might be VM sizes that have the same CPU and memory, but offer different storage performance. In Azure, one example of that is the Standard BMms and the Standard DS2 v2 node types. You can see this information in more detail from the VMs docs site in Azure.
If you want to learn more about storage best practices for AKS, take a look at the official docs site.
There are no magic formulas when talking about performance; you might already be implementing many of the configurations I discussed in this post. And that’s okay. You’ll always want to improve performance, and then keep measuring because there’s still going to be room for improvement, especially because there are topics that I didn’t cover in detail. But check out the links provided, where you can learn more about each topic.
If you want to learn more not just about how to improve the performance of your workloads in Kubernetes but also about recommended practices, give the official AKS site a look. Google also has a video series for Kubernetes best practices. In addition, you can get more insights into your application’s performance by profiling the code, centralizing logs, and tracking errors and metrics with a tool like Retrace.
Lastly, don’t take my words as final. Give each configuration a try and confirm that it works for your workloads.