Tech Talk: „Inner Sourced Infrastructure“ mit Karim Darwish von FireStart GmbH

Tech Talk: „Inner Sourced Infrastructure“ mit Karim Darwish von FireStart GmbH

Hi guys, my name is Karim, I’m a Junior Software Architect at FireStart. Today I wanna share our story how we moved to automated infrastructure using an inner sourced approach – all the challenges we faced, all the ups and downs.

Just to clarify something first: when I say infrastructure, I mean all the cloud infrastructure or the cloud services that our SaaS-solution needs in order to provide its services or to run. In our case this was a managed Kubernetes cluster, an Azure KeyVault to store the secrets, an Azure Storage Account for all the files and data, and some other managed database resources. Starting off, I wanna kinda go back in time to September 2020, at that point we had been running our whole infrastructure in our cluster for about a year, the catch however was that everything was configured manually. So we went to the Azure Portal, we created our cluster manually, we went for the wizard that Azure provides and did all the configuration manually. The same was the case for our Kubernetes cluster, we applied all the resources, all the configuration files manually using the command line. This wasn’t really a problem for us because things were working really great, we didn’t have any issues – until we then did.

One deployment kind of seemed to override a manual configuration we applied, causing our services to crash and fail how we provide our services. To fix this, we had to recreate parts of the cluster. Which is obviously not a good solution, we knew that we needed to change something. Just before coming to that, I want to start with our main „pain-points“ we had at the time. First, the huge amount of manual configuration that we had to do in order for our infrastructure to work – everything was done manually, this led to us having no reproducibility. When I say that I mean that we couldn’t really recreate our whole cluster in an automated fashion. We weren’t really able to create a new cluster for a new stage in an easy way, we would have to do this manually. Then we also had the problem – as we really didn’t know what was deployed and what was configured in our Azure configuration – that there was the possibility of something changing without us noticing. So we could have a drift between what was configured and what we wanted to configure.

While looking for a solution, we came across infrastructure as code. It seemed like the perfect fit for our use case. Because the whole idea is that all your infrastructure and everything that you need to provide your service is defined in code. And this would basically help us create our infrastructure in an automated fashion. As a tool we used Terraform, which is probably the most well known infrastructure as code tool. Here you can see a resource, that’s how you basically define everything that you want Terraform to create. In this case we have an Azure Kubernetes cluster with all the configuration that we want to create it with. What Terraform then does is that instead of us having to go to Azure, go through the wizard, do all the configuration manually – we can just define everything here in the code. We can version-control it and then Terraform then just used the API of Azure to create those resources. What’s really great about having everything defined in code, is that you can also apply common development practices. For example if you have four resources that you know belong together and are related to each other, you can say that you want to have a module for that, I want to extract that. If you know two resources are coupled, you can apply refactorings to decouple them. It’s also much easier for developers who did not have any real experience with DevOps to get started with that.

Having configured our whole infrastructure as code now using Terraform, we were able to create new clusters within the matter of executing one command and waiting for Terraform to make all the requests that we need to make. This really revolutionized how we build our infrastructure. We could just execute this one command, make sure that everything is running afterwards and there we have our new stage. That was our dream basically!

This was working really well. Then we broke it. What happened here is that we had our whole Terraform project set up, it was working well, but we applied all of this stuff locally. So we really went with our local machines, we ran the Terraform command on our local machines and had the problem of local configuration interfering with Terraform and its execution. The problem we had was that while Terraform was updating our cluster, we changed the Kubernetes context to point to a different cluster. So while Terraform was updating the one cluster, we told it now to update another cluster – breaking the second cluster in the process. Terraform applied a wrong configuration to that second cluster. And we again had to revert all those changes, we had to restore the services again, so not a great time. To prevent this from happening again in the future, we fixed it by moving everything to automated pipelines. Nothing would be done on local machines anymore, we also introduced a new process to how we want to do infrastructure by applying a GitOps approach using an inner sourced repository. What we did was instead of developers changing code or the infrastructure configuration in Terraform locally, they would now change it in Git. They would check out the repository, they would change the code, they would push it, create a pull request and then we had an automated pipeline that validates the code and the change that was made in the pull request. Terraform also allows us to create a diff-view, so that pull request’s author and the pull request’s reviewer could easily see the impact the pull request has on other resources. If the pull request did merge – and as a proof that it merged – an automated pipeline would apply all those changes to the real resources. So if you wanted to update our Kubernetes cluster, you would change it in the code and then basically tell the automated pipeline to do its thing and apply that change to the infrastructure.

The inner sourced approach is one we chose – basically, „inner source“ means that we have an internal open source structure/process around the repository. Our idea with that was that every developer of any team should be able to first of all know which infrastructure our service needs and then they should also be able to easily change the infrastructure, have an easy time fixing bugs with our infrastructure and obviously also be able to add features and resources that they need to build features.

Something this also allowed us to do is to finally introduce governance around that whole infrastructure topic – something we did not have before. Basically, as now everything was done using a Terraform pipeline, we could take away access from pretty much everybody in the organization, especially write-access, because they did not need to make changes to our cluster manually anymore. They could just make the change in the code and have the Terraform pipeline apply all those changes. This really helped us with following the least-priviledged-access principle and making sure that as few people as possible actually have write access to our infrastructure. It also allowed – forced! – us and our developers to really make every change using Git, to have a really transparent way of tracking who made what change, what changed when, and also reverting changes if needed.

This is where we’re at right now. We’re able to create clusters in a really short amount of time using these automated pipelines. We have the governance set up so that everything has to be done using Terraform and this GitOps approach. Something we’re also looking into right now is to introduce DevSecOps into this whole process of doing infrastructure – a security scanning tool that really checks your pull request for any security vulnerability that you might introduce in a change and act as a sort of quality gate to make sure we’re set up in the security department.

With that, if you also wanna break some clusters with us, create new ones in a short amount of time, then make sure to check out our career page! I’ve also added my LinkedIn and GitHub, if you have any questions, feedback, or just wanna chat – you can always hit me up there. And with that: thank you!