Hackathon: Could we run NU.nl on LeafCloud sustainable hosting?

Introduction

Once or twice a year, NU.nl organizes a hackathon, allowing its IT staff to dabble in new technology and pursue ideas, usually guided by several themes. 2023 kicked off with a hackathon having the themes ‘NU2030’, ‘NU.nl from scratch’ and ‘A greener NU.nl’.

Mid-2022 I visited EdgeCase, a 1-day conference about running Kubernetes at the edge. One of the presentations was by LeafCloud, a hosting provider focusing on reducing the environmental impact of cloud computing by re-using the heat it generates.

Back then, I somewhat jokingly proposed to run our /klimaat section using LeafCloud. So, knowing we have most of our computing on Kubernetes, and one of this hackathon’s themes was climate, there was no better moment to do a POC.

In this blog post:

Disclaimer: This is a personal blog. So while NU.nl (part of DPG Media) facilitated the hackathon and supports the theme, opinions and conclusions are mine. This post holds no commitment in any way from NU.nl or DPG Media.

LeafCloud

LeafCloud uses the heat generated by servers to substitute the use of fossil fuel for heating. It does so by constructing LeafSites at the location where heat is needed, such as swimming pools or apartment buildings. In short:

Don’t bring cooling to the data center. Bring the data center to the cooling.

It’s a creative and pragmatic approach to reducing the environmental impact which - it might be my background of Industrial Design engineering - resonates well with me.

Now there’s various other ways to reduce the environmental impact of computing:

Right-sizing and autoscaling
Serverless
Optimizing programming languages and frameworks
More efficient CPU, such as ARM

There is no ‘or’ here. All of them are worth pursuing. However, as the image below shows, none of them might be as effective as using hosting designed from the ground up to be environmentally friendly.

Energy Reuse Effectiveness (ERE). Source: LeafCloud

POC outline

The below image shows a simplified outline of NU.nl architecture. Website and mobile apps consume a Backend For Frontend (BFF), which uses various other APIs (mostly REST). All is frontend by Akamai to keep the bad people away.

Data and private network APIs constitute gravity, and moving away from it tends to be complicated. Compute consuming public APIs is easy, it can run anywhere.

The scope of the POC is deploying website (F1) and optionally BFF for a non-prod environment to LeafCloud. Getting the workloads to run somewhere else is not expected to be the hard part. The goal is to explore what it would take to go beyond a POC and identify blocking topics or topics that need further investigation.

Required abilities

The abilities required to run a set of Kubernetes applications outside of AWS effectively, can be categorized as follows:

Easily set up clusters
Quickly deploy a group of applications
Integrate with AWS and other services

Easily set up clusters

As Kubernetes matures, one could observe that various improvements have made it easier than ever to consider clusters as ephemeral resources. Clouds offer managed Kubernetes, more lightweight alternatives as K3S and RKE2 emerged. And then there is ClusterAPI: The ability to set up a cluster as management cluster and deploy remote clusters in a similar way as deploying pods.

LeafCloud is based on OpenStack, which itself also includes managed Kubernetes clusters. Exploring some of the options to get started resulted in the following:

Technology	Supports autoscaler	Getting started
OpenStack cluster	Yes	Average
RKE2	No	Easy
ClusterAPI	Yes	Hard

RKE2 is quite easily set up using the remche/terraform-openstack-rke2 Terraform module, which also sets up network components. LeafCloud was so nice to provide some example IaC based on this module that adds properly configured storage driver and cloud-controller-manager (used by K8S control plane to add load balancers).

So, since time was limited and we also wanted to explore other topics, we opted to hit the ground running and started with RKE2.

When exploring further, it would be worth trying out OpenStack as well as ClusterAPI. To see what’s possible using ClusterAPI, it’s worth reading this blog post by Helio.

Quickly deploy a group of applications

Once stamping out clusters is possible, making sure each runs the required set of applications becomes the next challenge. Often pipelines combine the build (CI) and the deploy part (CD). However, once the number of applications and possible deploy targets grows, maintainability becomes a problem: necessary pipeline changes = number of applications * number of targets.

So, instead of saying, “deploy this to cluster xyz” for each application, we need to just publish application artifacts and configure clusters xyz to “run this collection of applications.” We need building blocks that we can easily compose.

Composition can be accomplished using:

Terraform modules and submodules
ArgoCD applications, via the App of Apps pattern

Both can refer to manifests, Kustomize overlays or Helm charts. Terraform is push-based and integrates more easily with other IaC, using outputs of other modules to set variables to K8S deployments. ArgoCD is pull-based and has the advantage of not requiring external access to the K8S control plane.

For this POC, we used Terraform since other IaC was already based on Terraform. Terraform solves the maintainability problem by composing modules and submodules. At scale, it would still require the terraform apply to run across an increasing amount of clusters, updating an increasing number of apps, which might become problematic. For now a theoretical problem, but at some point, it’s likely that ArgoCD would be more effective.

Integrate with AWS and other services

Our center of operations is AWS, combined with various SaaS solutions (observability, security). Most of the concepts here could apply to other clouds as well.

Identity (IAM)

There are two directions here:

Using IAM from AWS to grant access to the K8S control plane
Using IAM by workloads from within the cluster to interact with AWS services

The former can be accomplished by AWS IAM Authenticator. CI/CD processes within AWS can use IAM roles to access the cluster. Lock away the original client-certificate-based kube-config and use short-lived IAM-based tokens from there on.

Using IAM from the cluster is more complex. There are roughly two ways to do this: ¹

Use IAM access keys or session tokens
IAM anywhere

IAM anywhere is not for the faint of heart and, unless planned and executed perfectly, can be a surefire way to shoot oneself in the foot. Zscaler has an interesting blog post about this.

IAM credentials it is then. There are some ways to improve the security posture of this (also addressed in the Zscaler blog-post):

Automate the cycling of credentials
Alerting on key authentication failures (condition fail or attempting actions outside of granted privileges)
Use credentials unique to a specific remote cluster and application
Apply least-privilege principles
Use conditions to restrict credential usage to the specific remote cluster

One can put access keys in the remote cluster or session credentials. Audit logs would show AKIA... (access key) or ASIA... (session credentials), resulting in either screams of terror or sighs of relief². Using session credentials inherently forces rotation which is good, the downside is that failing to do so immediately results in an outage.

Automation and alerting are intentionally put on top since they tend to fall in the ‘do later’ category. Done right, this could address the concerns outlined in the Zscaler blog-post:

Access keys can be forgotten and leaked
- Short max age
- Condition preventing use outside of the cluster
There is no visibility into who is the entity using these keys
- Key having well-defined cluster and application scope
- Condition preventing use outside of the cluster
- Alerting on key authentication failures
They require regular rotation
- Automated cycling
- Short max age

It is clear that the above involves more moving parts than only using IAM roles within the AWS environment. That is the trade-off of using more than one cloud vendor.

Important: Security is hard. Validate anything security related with appropriate teams or peers that can challenge these concepts. Also be sure to point out any flaws in the above via the social links on this blog.

Network

For starters, network presence should never be considered an authorization mechanism. This is part of the Zero Trust approach that is becoming prevalent in the industry.

That said, private networks can add a layer of security and are common. Integrating an AWS VPC with a private network is not trivial and could offset most of the potential cost savings.

There are Zero Trust products that can link networks and devices by means of an identity-aware overlay network. Often these are based on WireGuard, creating a mesh VPN. The no-bullshit ZTNA vendor directory is a great starting point when exploring this topic.

For our POC, we limited ourselves to workloads that consume public APIs only, avoiding the need to integrate networks.

Services

Integrating with AWS can be accomplished by leveraging IAM, as described previously.

Observability and security components can be installed on remote clusters in a similar way to existing clusters. Information flows to the SaaS platforms already in use, or to centralized setups.

Putting it together

Putting all of the above together results in the following setup³:

Terraform has proven valuable in combining various cloud platforms in a single IaC setup. We can use AWS for Terraform state and storage of OpenStack credentials, use OpenStack provider for cluster setup, and use Kubernetes/Helm providers for deploying cluster resources.

Even when moving beyond a POC and possibly integrating ClusterAPI or ArgoCD, Terraform could still be the linking pin because of the powerful chaining of output values and variables. Custom resources like a ClusterAPI Cluster or ArgoCD Application could be deployed by Terraform, having values that originate from other cloud resources.

Why would we do this?

Well, there is ‘lead by example’: If we want to reduce our environmental impact, we should consider all options to do so. That said, putting our workloads on a different cloud might not be the lowest of hanging fruits.

The potential emission savings are shown below, as well as a pricing comparison of AWS and LeafCloud resources. Make no mistake, setting up shop elsewhere takes effort and that brings cost, but the fact that the cloud resource costs could be up to tens of percent lower at least shows potential.

Based on 20kg/core/mo (Source: LeafCloud). Calculations via EPA Greenhouse Gas Equivalencies Calculator.

Cost. Res/Sp = Reservation/Savings Plan.

Concluding

As expected, deploying the workload itself in a different cloud was easy. Integrating various AWS services via IAM is possible but requires upfront planning and careful evaluation and execution. Integrating private networks involves procurement and integration of a networking solution. Networking would be the biggest hurdle moving forward.

Exploring further, topics to address include:

Adapting the POC to a tech stack that supports cluster-autoscaler
Evaluate cluster setup via Terraform vs. ClusterAPI
Evaluate bundles of apps via Terraform vs. ArgoCD
Evaluate resiliency and fail-over scenarios when using ‘single-AZ’ clouds⁴
Prepare for day two operations: Cluster upgrade and node patching. (Swapping out version n with version n+1 clusters could be an option)

Multi-Cloud does not need to go as far as being cloud-agnostic, resulting in the lowest common denominator of each cloud and lots of abstractions. Mixing different cloud services based on available features or cost is possible. As always, it comes with trade-offs.

LeafCloud’s underlying platform OpenStack being well-documented and supported allows for a smooth onboarding process and, from a strategic perspective, avoids a lock-in into a single new cloud. Tools and services already in use can, in most cases, be easily integrated.

I would encourage anyone remotely into multi-cloud and sustainability to take a look at LeafCloud. Making an impact requires taking steps. Hopefully, this blog post shows that making the biggest impact might not even require the biggest of steps.

Thanks go to LeafCloud for great assistance during this hackathon!

Besides probably a number of vendors wanting to solve the multi-cloud identity problem. Not in scope for a POC. ↩︎
Admitted: Doing somewhat uncommon things with IAM is unlikely to cause any sighs of relief at all. ↩︎
Observability and Security tools for illustrative purposes. We’re not using all of them. ↩︎
Are you 100% sure your multi-AZ setup can handle AZ-failures? Do you test it? ↩︎

TBNL

…another view on the web and how it’s built