brainhive // Kubernetes in Action

At brainhive, our responsibilities extend beyond writing code; we also provide monitoring, support, and maintenance for our clients. Currently overseeing about 5 clusters with a total of 24 nodes, a pivotal aspect of our service is managing containerized applications on Kubernetes.

What is Kubernetes?

Kubernetes is an open-source platform designed for the automation of deployments, as well as for scaling and managing containerized applications. Thanks to its platform-agnostic nature, it allows us to support various workloads, regardless of whether they’re located on-premise, in the cloud, or a combination of the two.

Since its initial release in 2014, Kubernetes has become the de-facto standard for container orchestration, trusted by large companies such as Booking.com, adidas, Huawei, ING, Nokia and many others.

Cluster topology

Every organization comes with its own set of unique requirements and constraints but we can identify some common patterns. A typical setup might involve both a staging and production cluster, each equipped with a single node pool. We often use namespaces for different environments, but as we want to test Kubernetes upgrades before applying them in production we prefer at least two clusters for each client.

To give an idea of what it looks like, let’s take a look at the simplified setup for a fake client called Y Corp using any cloud provider:

graph

As you can see, both clusters have a different number of (minimum) nodes and node sizes. At the same time we want to keep the clusters as similar as possible to avoid configuration drift.

Configuration management

Given our team size, we need to be able to manage many clusters efficiently. After exploring several alternatives, including Terraform, we finally settled on Flux. Flux is a suite of tools that enables resource management through a declarative GitOps approach.

This has several advantages:

Version control; Changes are tracked in Git, allowing us to easily audit, review and revert changes.
Single source of truth; We are capable of managing clusters, regardless of their cloud provider or on-premise origin, all through a single repository.
Reproducibility; We can easily configure new clusters from scratch, based on existing ones.

Our core flux configuration, which contains configurations for all clusters, looks somewhat like this:

├── clusters │ ├── y-corp-staging │ │ ├── cert-manager │ │ │ ├── kustomization.yml │ │ │ ├── namespace.yml │ │ │ └── release.yml │ │ └── flux-system │ │ ├── kustomization.yml │ │ ├── namespace.yml │ │ └── release.yml │ ├── y-corp-production │ │ ├── cert-manager │ │ │ ├── kustomization.yml │ │ │ ├── namespace.yml │ │ │ └── release.yml │ │ └── flux-system │ │ ├── kustomization.yml │ │ ├── namespace.yml │ │ └── release.yml │ └── another-cluster..

Where each module has its own kustomization file and installs one or multiple Helm releases. Being able to declaratively install Helm charts is a huge advantage, as it allows us to easily upgrade and rollback releases to multiple clusters at once.

Secret management

Some components in our Flux configuration require secrets, such as an API key for a monitoring solution. Emphasizing security by design, we crafted our Flux setup so that even if it were publicly accessible, no sensitive data would be exposed. Consequently, storing plain text secrets in Git is not an option for us even though the repository is private.

While there are many ways to store secrets securely in a git repository, we opted for sealed-secrets. A key reason is that a sealed secret can only be decrypted by the controller in the target cluster and nobody else, not even by its creator. Managing secrets this way is easy, we create a SealedSecret resource in the same repository and it will automatically be decrypted and applied to the cluster.

Monitoring

Our main monitoring stack mainly consists of Prometheus, Grafana, and Loki, each being open-source and designed for cloud-native environments. Prometheus enables us to gather metrics from our applications and infrastructure, and Grafana provides the means to visualize and set alerts based on those metrics.

We opted for Grafana Cloud over self-hosting Prometheus/Grafana to keep our monitoring distinct from our main infrastructure, ensuring resilience and recuding internal system dependencies. Each cluster has its own monitoring configuration, which is managed through Flux.

Security

We utilize a central identity provider for managing users and permissions, enabling us to revoke access to all our systems with just one click. While we’d prefer using this for Kubernetes authentication, its built-in support for Open ID Connect lacks usability and flexibility. Additionally, OIDC configuration requires Kubernetes apiserver access, which isn’t always available.

Though providers like AKS, AWS, and GKE have their specific solutions, we sought a straightforward, universal option. We chose Pinniped without federation, deploying only Pinniped’s concierge and setting up a JWTAuthenticator for each cluster.

It’s worth noting that despite having one identity provider, we set up an OpenID Connect Relying Party for each cluster, as explained in the documentation:

In general, it is not safe to use the same OIDC client across multiple clusters. Each cluster should use its own OIDC client to ensure that tokens sent to one cluster cannot also be used for another cluster.

Imagine a scenario where a user has access to a staging cluster, but not to production. If both clusters use the same OIDC client, a token for the staging cluster would also be valid for the production cluster.

Conclusion

In this article, we shared our playbook for multi-cluster management, drawing from our strong emphasis on automation and security to guide our fellow developers through the challenges that come with it. Managing multiple Kubernetes clusters can be tough, but with the right mindset and approach, it’s a challenge we can all conquer.

Wondering how we can help you with your Kubernetes journey? Get in touch with us!

Published 10 aug. 2023

Kubernetes in Action

Our playbook for addressing multi-cluster hurdles head-on