Three steps for choosing the right infra as code tool in AWS

·

12 min read

You are running workloads in the cloud, and want to step up your game by using infrastructure as code. Awesome! Which tool are you going to choose? And why? It is hard to find nuanced guidance online: Few developers have in-depth experience with the technology and even fewer people have tried many different tools in a production setting. As an AWS consultant, I and my colleagues have tested out all sorts of tools in real-life work situations. We got a taste of what really held us back, what was awesome, and what seemed like a big deal but turned out not to be. So, in this blog, I will try to help you pick the right tech by showing you what mattered the most in my experience.

By following three steps, I hope you will have a better understanding of which tool is the right one for you.

Step 1: Should I (not) use CloudFormation?

Are you considering CloudFormation, or a derivative (including CDK, SAM, Serverless Framework and more)? In my experience, I would advise against CloudFormation-based tools if you meet one of the following conditions:

  1. You generate drift from infrastructure templates

    Typically there are two major contributors to drift:

    • Application deployments cause infrastructural drift

      Some application deployment processes update AWS infrastructure. For example, the ECS update-service API can be used for both infrastructure configuration and application deployment. CloudFormation cannot easily deal with this split responsibility as updating the ECS Service in CloudFormation will trigger a complete deployment of an (old) container image.

    • Operational processes are managed outside of the infra as code
      We sometimes need to update our RDS database versions, make manual fixes, or do database restores and don't want to do this using infrastructure as code. Operations are automated - just not using the infrastructure as code tool. If this is the case for you, CloudFormation is pretty bad at handling this type of drift, and re-importing these changes are a nightmare.

  2. You are in a non-cloud-native ecosystem (open-source and any third parties)
    If want to configure CloudFlare, Kubernetes or other clouds using the same toolset: Cloudformation is not your friend. Yes, CDK has custom resources for Kubernetes, but NO, I would not recommend them as they make things unnecessarily complex.

You might wonder why I only exclude CloudFormation here: CloudFormation is the only tool with limitations strong enough to let me disregard them from the start. Other tools also have their downsides, but these are not strong enough to not use them.

Step 2: Should I (not) use a programming language?

Just because a tool is using a programming language doesn't make it better. To illustrate: Imagine that CDK only created plain CloudFormation resources (L1 constructs in CDK terms) and you have to write all abstractions yourself! Would it really be better than regular CloudFormation? You would have the benefit of type-safety create a for-loop. But in return, you have to worry about tsconfig settings, more complex dependency management and more complex version compatibility! You might be better off with Terraforms' HCL language in this case.
Overall, these 'advantages' should only be considered a MINOR benefit in most situations. Instead, first, consider the following sections, THEN look if one with the programming still makes sense.

Step 3: Consider the things that matter

So what should you care about? The tool should provide sufficient benefits to offset an increase in complexity. My main criteria are: it should provide me with FAST and RELIABLE development of infrastructure, now and in the future. Security for me is a prerequisite that is required before a tool is even considered. Here are listed a few factors that you can use to benchmark your tool with these requirements in mind.

  1. Give your tool a score for each factor listed in each category (1-10)

  2. I've provided my own opinion on how important a factor is. Multiply AWESOME benefits by 3, while multiplying GREAT benefits by 1.5, MEDIUM by 1 and MINOR benefits by 0.5. Feel free to adapt these factors according to your specific situation.

  3. Add up the points to see how they score in each category.

  4. Compare the results for different tools.

A template scorecard is put at the bottom of the list as well.

Development speed

  1. Managed Abstraction: AWESOME benefit
    The CDK team is focussing a lot on providing an abstraction for a wide set of use cases that are available out of the box! It makes AWS IAM and network connectivity a lot easier!
    But did you know SAM and Serverless framework also provides abstractions purpose-built for your serverless projects? It does not include wild abstractions like the ApplicationLoadBalancedFargateService class of CDK, but for serverless applications it is often exactly what you are looking for!
    Pulumi Crosswalk also has some nice abstractions, but aren't as extensive as CDK's.

  2. Fail & debug fast: AWESOME benefit

    All tools have a way to make development easier and detecting mistakes early. All of terraform validate / plan, cdk watch, pulumi preview, and various IDE plugins can help you spot mistakes early. Clear errors and logging greatly help with debugging as well. Some tools provide fast failure for developing serverless apps like the sam sync, cdk watch and the cdk --hotswap command, which deploys lambda updates right after you write them. SST is for me the absolute king here, with its sst dev command enabling you to provide a complete local portal to debug and test your application stack.

  3. Application deployment: AWESOME benefit

    CDK is good at building lambda's and containers and deploying them right away! Other frameworks like serverless framework and SAM are also pretty good. It is one of my favorite infrastructure as code features because it makes it so easy to do an end-to-end deployment of an application from scratch.

  4. Shallow learning curve: AWESOME Benefit

    AWS SAM and the Serverless framework are SUPER EASY to get started with. It's YAML with plugins. Other tools require more thought. CDK especially requires you to learn CDK best practices, Cloudformation for debugging, and programming language best practices. This is a lot to take in from the start. The same applies to terraform wrapper-tools like terragrunt. By introducing them, you have to learn more tools to get started. Having a shallow learning curve tool also enables you to pick more different tools for different use-cases.

  5. Language flexibility: MEDIUM benefit

    All tools have features or plugins built in to deal with common problems like loops or string or array manipulation. Although sometimes annoying, it is rare to find limitations in language capabilities that have a massive velocity impact.

  6. Ecosystem integration: AWESOME benefit
    Each tool is part of an ecosystem. For example, CDK patterns libraries for quick-starting your project. But also think about static code analysis (e.g. CheckOV), or available CI/CD integrations (GitHub actions, CircleCI). Check out the awesome-cdk and awesome-terraform repositories to get an initial impression of their impressive ecosystem.

    It saves you a LOT of time if you don't have to integrate tools yourself.

  7. Purpose-built benefits and limitations: AWESOME benefits (hopefully)
    Depending on your use-case pick a purpose-built tool. These are benefits ONLY that tool provides for a narrow set of use-cases. For example, Serverless Stack (SST) has put a LOT of effort into improving the full-stack developer experience. Local lamba debugging, even more abstraction and front-end + back-end type-safety features make this an awesome tool for serverless full-stack developers.
    It also works the other way around: if the purpose-built tool restricts you for your use case, give a low score. For example, with CDK it is not great with multi-account state sharing. Or in SST, the only supported relational database is RDS Aurora V1, which is infamous for its slow scalability.

Reliable deployments

  1. Type safety: MINOR benefit
    Typing can provide early detection of type errors in your infrastructure. Just make sure you also create the type definitions and interfaces for your own classes! If you do not do that or if it is not important, what good is it?
    Another problem here that is often overlooked, is that it is not trivial to share these types with your actual application, since they might use different runtime versions or bundlers. It IS possible, just make sure you're willing to put in the work.

  2. Good preview: GREAT benefit

    This is where all CloudFormation tools in my opinion are still lacking a bit. Most other tools provide a relatively good overview of what has changed since the last run.

  3. (Continuous) corrective changes: MEDIUM benefit

    Some tools detect drift and attempt to fix it to a desired state when changes are applied. Crossplane can even do this continuously. By detecting changes outside of the tool's state, deployment reliability is improved.

  4. End-to-end infra testing: This would be an AWESOME benefit (:cries-in-custom-scripting:)
    However, no IaC tool currently supports a good test framework. I would love test framework that integrates with any IaC tool and abstracts common test cases... Just imagine a world where you could do high-level assertions on your infrastructure as code resources directly, like checking if a step function has finished successfully, an SQS message has been sent, or an S3 object has been created...
    Unfortunately, we can only dream about this now :(.

Maintainability

Maintainability is for me the amount of effort involved in making sure development speed and reliability are maintained over time.

  1. Evolvable codebase: AWESOME benefit
    Over time your application grows and changes. This could mean that names become ill-descriptive, and files become large and unreadable. Both Terraform and Pulumi do a relatively good job at restructuring files with little impact. In Cloudformation-based repositories, it is only possible to restructure inside the same CloudFormation stack. However, logical renames are a no-go.

  2. Modularity (self-managed abstraction): Great benefit
    Breaking down the codebase into modular components or functions promotes code reuse, improves readability, and makes it easier to understand and maintain specific parts of the system without affecting others. Terraform provides modules, Crossplane has Composite Resources, Pulumi has its ComponentResource and CDK uses constructs to provide this modularity. One piece of advice: DON'T OVERDO THIS! It's a neat feature, but if you need more than 2 layers of self-managed abstraction for your application, chances are you're doing it wrong.

  3. Little housekeeping: Great benefit

    Not having to deal with (major) package upgrades, provider updates, and state management is great.

Scorecard template
You can use this scorecard template to help get an overview. By doing some targeted research per category, you'll get a relatively accurate comparison between different tools.


WORKLOAD DESCRIPTION:

CATEGORYIMPORTANCE (0-5)TOOL SCORE (0-10)WEIGHTED SCORE
Managed abstraction(3)
Fail & debug fast(3)
Application deployment(3)
Shallow learning curve(3)
Language flexibility(1)
Ecosystem integration(3)
Purpose-built benefits and limitations(3)
TOTAL VELOCITY
Type safety(0.5)
Preview changes(3)
Corrective changes(3)
End-to-end infra testing(3)
TOTAL RELIABILITY
Evolvable codebase(3)
Modularity (self-managed abstraction)(1.5)
Little housekeeping(1.5)
TOTAL MAINTAINABILITY
TOTAL

General recommendations

Choosing multiple tools

It is OK to choose multiple tools for different parts of your application landscape. Purpose-built is the keyword here. The simpler the tool is to learn, to more different tools you can pick. Try to not choose multiple tools with steep learning curves.

Common example scenarios

These would be my general pieces of advice for some common scenarios.

You migrate a workload to AWS. It contains a mix of EC2, containers and you are not particularly preferring any ecosystem:

I would advise you to go for Terraform. It has very strong ecosystem benefits and a shallow learning curve. It is verbose (especially IAM policies and Security groups), but if you're not heavy on the cloud-native stuff, it shouldn't be a big problem. Since you are migrating, I assume several 'old' processes are still in place causing infrastructure drift. Overall, terraform would be a great fit for this use case.

Pro-tip: don't go crazy on terraform wrapper tools from the start. You can get things to work on a surprisingly large scale without these terraform wrappers. Implement only what you need in the foreseeable future.

Note that Pulumi is great too but has a slightly smaller ecosystem available (integrations with common CICD tools, static code analysis etc.) and the learning curve is steeper in my experience.

You run Kubernetes in AWS and want to make limited use of cloud-native tooling:
You still can't go wrong with Terraform for infrastructure configuration. Pulumi is also an option, but often I don't need the programming language benefits it provides.

For Kubernetes configuration, you might be better off with a separate tool as you miss out on development speed due to the limited managed abstractions it provides. For infrastructure configuration, cross-plane is 'relatively' new here, so if its unique benefits like continuous corrections on infrastructure state are solving a big problem for your organization, feel free to choose that one. If its benefits are 'nice to haves', I would recommend technology with a longer track-record like Terraform.

You are developing a cloud-native container solution on AWS
If you are using ECS, it depends. For mono-repo approaches, CDK is nice because of its bundling capabilities. For split responsibilities (one repo for infra, one for app deployments), I'd go for another tooling due to infrastructural drift caused by application deployment.

You are developing a cloud-native ETL flow on AWS
If you work with services like Glue, Lambda, S3, Lake Formation, Athena, then CDK is a great tool. Its versatility and abstraction will save you lots of time. Also, infrastructural drift is usually less of an issue in these scenarios.

You are developing a serverless solution on AWS
Go for the simplest tool that (a) supports your use cases, (b) provides a lot of abstraction and (c) fits well within your ecosystem. For example:

  • A simple small stateless API can easily be done with SAM CLI or serverless framework. If you are using Python extensively, chalice might even be a better fit for your tooling.

  • A low-volume cloud-native web app is a perfect use-case for serverless stack (SST) Also works for medium-high volume apps if you work around the RDS Aurora V1 scaling.

Conclusion

By using the template scorecard and adjusting the weights, I hope you can make a more informed decision on the infrastructure as code tool for use.