In a recent (German) blog post, Christian described what we want to achieve by moving into the cloud. He also touched on some of the points on how we are going to do things. One point he mentioned was that we define everything in the cloud as infrastructure as code. While our payloads are primarily deployed to Azure Kubernetes Services (AKS) using GitOps with ArgoCD, we have decided to deploy the infrastructure using Terraform.
Why Go Into the Cloud?
Our goal is to overhaul Interhyp’s IT infrastructure using cloud technology. This transformation is a must-do to not lag behind our competitors in terms of cost efficiency, on-demand scalability and product development velocity (by facilitating rapid prototyping, reducing time-to-market, and enabling access to plug-in state of the art technology). This makeover is prime for achieving our ambitious growth goal of 20 % market share by 2025.
Going Into the Cloud Is Not Easy
You can start using the cloud with a sequence of clicks – if you are a small company with little regulatory requirements. Interhyp is neither. We face TWO major issues: Collaboration and Auditability.
The First Problem: Each and Every Infrastructure Change Must Be Documented… Forever
As part of the publicly traded ING Bank, we must comply with a wealth of regulatory requirements intended to maximize the integrity and security of our systems and data. One of them is extensive documentation of code & system changes: it must be clear, at any given time, who has changed any part of the code or system, at what time, what was changed and for what reason.
The Second Problem: Collaborative Cloud Infrastructure Development
As we expect well above a 100 engineers to work at the cloud infrastructure in collaboration, sometimes many simultaneously, we have to solve a major issue: how can our engineers work together on the same infrastructure at the same time without creating chaos or breaking the system?
Infrastructure as Code
The solution is called “Infrastructure as Code”: Each cloud resource is declared as machine-readable code. See this little example:
Having this, each infrastructure change is recorded in the version control system (e.g. GitHub), providing a perpetual audit trail, enabling peer reviews and sophisticated approval controls. However, it raises the question:
Once you start using tools like Terraform, you quickly realize that your code will be a mess, a huge mess, and you need to manage it somehow. There is just too much repetition, not enough control, and no clear structure and pattern you can follow. That is why we tossed Terragrunt into the mix and used it to clean up and define our infrastructure code. We will cover our Terragrunt and Terraform setup in more depth in another blog post.
From this, a new problem arises. Now the learning curve is so steep you could climb Mount Everest with it. Not only do you need to understand the intricacies of the Azure resource, but you also need to wrap your head around Terragrunt and Terraform details. We had to do something to make it usable for the actual users; the engineers and developers.
How Can We Make the Cloud Usable?
We needed to make deploying the resources required for running applications without compromising on security and compliance easy. In addition, it must also be possible to expand and customize the resources if needed.
We created a simple scaffolding process that generates all the necessary files based on custom configurations to solve this. Here is how our prototype looks like for scaffolding and deploying a CosmosDB.
The Interhyp Cloud Scaffolding Process
Step 1: Configure
We start by generating a config for the blueprint generator. The config contains all the required properties for generating the CosmosDB Terragrunt code.
> configure.sh cosmosdb
generated-cosmosdb.yaml
generator: cosmosdb
parameters:
databaseName: "demo"
longName: "dem"
shortName: "dm"
confidentiality: "C2"
integrity: "I3"
availability: "A2"
Step 2: Scaffold
We fill in the required properties and generate the Terragrunt code by using the same generator cli. The generator will read the config and use it to construct the Terragrunt files.
> scaffold.sh cosmosdb
cosmosdb/terragrunt.hcl
include {
path = find_in_parent_folders()
}
locals {}
terraform {
source = "git::ssh://git@myhost.com/interhyp/az-cosmosdb.git//?ref=v1.2.0"
}
dependency "cosmosdb_account" {
config_path = "${dirname(find_in_parent_folders("product.hcl"))}/platform/database/cosmosdb/C2I3A2/mongo/account"
mock_outputs_allowed_terraform_commands = ["init", "validate", "plan"]
mock_outputs = {
name = "cosmosdb-account-mock"
resource_group_name = "cosmosdb-account-rg-mock"
}
}
inputs = {
resource_group_name = dependency.cosmosdb_account.outputs.resource_group_name
cosmosdb_account_name = dependency.cosmosdb_account.outputs.name
database_name = "demo"
collections = {
"docs" = {
shard_key = "_id"
}
}
}
cosmosdb/resource.hcl
locals {
resource_short = "dm"
resource_long = "dem"
tags = {
}
}
You may have noticed that the generator only created the code for the container within an existing CosmosDB account. This is intentional as our database team manages the CosmosDB accounts, the same team that develops and owns the database terraform modules. The team ensures that all compliance and security settings are applied by default and cannot be changed.
Step 3: Deploy
All that is left to do is check everything into version control and deploy the freshly generated CosmosDB. This can be done using the known Terragrunt commands, or our shortcut deploy.sh script.
> deploy.sh cosmosdb
The best part: the just deployed CosmosDB is secure and compliant by design, meaning:
- Network access is only possible thru our internal network – no public endpoint is exposed
- All data is encrypted with our own keys stored in a HSM and managed by Azure Key Vault
- Access to the database is controlled by a distinct set of roles and group assignments
Deploying to Production
The above process only covers the procedure for deploying to the dev stage. The developers and engineers have limited access to the subscription and can only deploy curated resources, like a CosmosDB, Azure Function, and so on. Infrastructure that goes productive must go thru the following process.
- Copy the generated terragrunt code into the relevant subscription directory, e.g. prod
- Check-in all of the files with source control
- Create a pull request against the mainline
- Wait for our approval service (azure-valve) and the Atlantis plan output
- Get the approval from at least one other person not involved in the pull request
- Once all checks pass the pull request can be applied with Atlantis and is merged automatically into the mainline
Let us know in the comments if you want us to go deeper and explain the actual scripts and scaffolding process happening behind the scenes. Maybe you are more interested in the details about the deployment process with Atlantis and the role our azure-valve service takes?