At Banzai Cloud we are building a managed Cloud Native application and devops platform called Pipeline. Pipeline supercharges the development, deployment and scaling of container-based applications with native support for multi and hybrid-cloud environments.
The Pipeline platform provides support for advanced scheduling that enables enterprises to run their workflows in an efficient way by scheduling workflows to nodes that meet the needs of the workflow (e.g.: CPU, memory, network, IO, spot price, etc).
Pipeline sources infrastructure and price attributes from CloudInfo and automatically labels corresponding nodes with this information. Beside these automatically set labels, users can supply their own labels to be placed onto nodes. With the use of these node labels, in conjunction with node selectors or node affinity/anti-affinity, Kubernetes can be instructed to schedule workflows to the appropriate nodes for optimal resource utilization, stability and performance.
If you’d like to see an example of how node affinity/anti-affinity works, check out our Taints and tolerations, pod and node affinities demystified post. If you are interested in learning how we schedule workloads to spot instances, you should consider reading our Controlling the scheduling of pods on spot instance clusters post.
In this post, we’ll describe how users can provide their own labels for nodes via Pipeline, as well as what goes on behind the scenes during this process.
Users can set node labels at the node pool level; each node in a node pool may have both user-provided labels and the labels automatically set by Pipeline and Kubernetes.
A node pool is a subset of node instances within a cluster with the same configuration. However, a cluster can contain multiple node pools, as well as heterogeneous nodes/configurations. The Pipeline platform is capable of managing any number of node pools in a Kubernetes cluster, each with different configurations - e.g. node pool 1 is local SSD, node pool 2 is spot or preemptible-based, node pool 3 contains GPUs - these configurations are turned into actual cloud-specific instances.
We apply the concept of node pools across all of the cloud providers we support, in order to create heterogeneous clusters, even to those providers which do not support heterogeneous clusters by default (e.g Azure, Alibaba).
We have open-sourced a Nodepool Labels operator to ease labeling multiple Kubernetes nodes within a pool/cluster.
Users are not allowed to specify labels which conflict with the node labels used by Kubernetes and Pipeline. This is enforced by validating user-provided labels against a set of configured reserved label prefixes such as k8s.io/, kubernetes.io/, banzaicloud.io/, etc. We encourage users to follow the convention of omitting prefixes, since those labels are presumed to be private to the user, as stated in the Kubernetes documentation.
The following diagram illustrates node labelling flow during cluster creation:
Node Pool Upscale
ASPECTS THAT NEED TO BE CONSIDERED WHEN SETTING LABELS ON NODES
Labels must be set prior to workloads being scheduled
As of writing, Kubernetes (1.13) supports only two types of node affinity, called requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution. The IgnoredDuringExecution part of these names refers to whether or not labels on a node change at runtime, so that if affinity rules pertaining to the pod are not met the pod will continue to run on the node. This is a problem. Imagine using node anti-affinity to avoid certain pods that are to be scheduled to nodes with specific labels. Such a pod might be scheduled to undesired nodes, if those nodes are not labelled prior to scheduling.
If we have full control over how nodes are provisioned and configured, we can pass labels into the kubelet config. This ensures that labels will already be on the node at time of scheduling. Also, some cloud providers (e.g. Google) allow supplying nodes with labels during cluster creation. Let’s call these pre-configured labels. When this option is not available, we must resort to using Kubernetes to set labels on nodes. In such cases, we have to ensure that no deployment occurs to the cluster until node labels are set.
Unfortunately, we’re not out of the woods yet. There are other potentially serious scenarios worth anticipating.