Terraform Best Practices: Writing Code That Won't Make You Cry Later - T12

Hey there, Terraform warriors! Welcome to the ultimate guide on Terraform best practices.

Whether you’re managing a handful of resources or orchestrating enterprise-scale infrastructure, this guide will help you write Terraform code that your future self will actually thank you for.

I’ve spent years building, breaking, and rebuilding Terraform configurations. Today, I’m sharing everything I’ve learned so you can skip the painful parts and jump straight to writing beautiful, maintainable infrastructure code.

Why Most Terraform Projects Turn Into Nightmares

Let’s be honest – we’ve all been there. Your Terraform project starts as a beautiful, simple configuration. Three months later? It’s a sprawling monster that no one dares to touch. Here’s what typically goes wrong:

Simple changes take days because the code is a tangled mess
Teams are afraid to deploy because something always breaks
New developers need weeks just to understand what’s going on
Production incidents happen due to preventable configuration mistakes
Module reusability is a myth because everything is hardcoded

But here’s the thing – it doesn’t have to be this way! With the right practices from day one, your Terraform code can be a joy to work with, even years down the line.

The 7 Deadly Sins of Terraform (And How to Avoid Them)

Sin #1: The 3,000-Line Monster File

The Problem: Everything crammed into one massive main.tf file.

Imagine opening a file with 3,000+ lines where VPC configuration blends into database setup, which morphs into Lambda functions. Finding anything becomes a treasure hunt, and God help you if two developers need to work on different parts simultaneously – merge conflicts will become your daily nightmare.

Why This Happens: It usually starts innocently. You begin with a simple VPC and a few EC2 instances. Then you add RDS. Then Lambda. Before you know it, you’re scrolling for minutes just to find that one security group rule.

The Better Way: Think of your infrastructure like a well-organized kitchen. You wouldn’t throw all your utensils, plates, and food in one drawer, would you? Apply the same principle:

Keep your main.tf under 100 lines – it should orchestrate, not implement
Create separate files for logical components (networking.tf, compute.tf, database.tf)
Use modules for reusable patterns
Follow the single responsibility principle

Golden Rule: If you’re scrolling to find something, it’s time to split the file.

Sin #2: Copy-Paste Environment Hell

The Problem: Separate folders with duplicated code for each environment.

This approach seems logical at first – “Let’s keep dev separate from prod!” But what happens when you need to add a new security group rule? You update dev, test it, then copy to staging, then to prod. Miss one? Congratulations, your environments are now different, and you won’t know until something breaks.

Why This Is Dangerous:

Configuration drift is inevitable
Security patches become a multi-day ordeal
Testing loses its meaning when environments differ
“But it works in staging!” becomes your team’s anthem

The Better Way: One codebase, multiple configurations. Think of it like a recipe – you don’t write three different recipes for small, medium, and large portions. You have one recipe and adjust the quantities.

Use variable files (tfvars) to handle environment-specific values while keeping your infrastructure definition DRY (Don’t Repeat Yourself). This ensures that when you fix a bug or add a feature, it’s automatically available across all environments.

Sin #3: Hardcoding Everything

The Problem: Values baked directly into resources like concrete.

When you hardcode values, you’re essentially writing a love letter to technical debt. That t3.medium instance type? What happens when you need t3.large in production? That AMI ID? It’s region-specific and will break the moment you deploy elsewhere.

Why This Kills Scalability:

Environment promotions require code changes
Multi-region deployments become impossible
Cost optimization requires touching every resource
You can’t share modules between projects

The Better Way: Variables are your friends, but smart variables are your best friends. Don’t just extract values into variables – add validation, descriptions, and sensible defaults. Think about:

What might change between environments?
What might change between regions?
What might change as the application scales?
What would a new team member need to know?

Use variable validation to catch errors early. If someone tries to deploy a t2.nano for a production database, wouldn’t you rather catch that during plan than at 3 AM when the site is down?

Sin #4: Local State Files (The Collaboration Killer)

The Problem: State files living on developer laptops.

This is like keeping the only copy of your house keys in your pocket while going swimming in the ocean. Local state files are disasters waiting to happen:

Laptop dies? Your infrastructure is now orphaned
Two developers run apply simultaneously? State corruption
Need to rollback? Hope you have that old state file somewhere
Audit requirements? Good luck explaining your “process”

Why Remote State Is Non-Negotiable: Remote state isn’t just about backup – it’s about collaboration, consistency, and confidence. With remote state:

State locking prevents concurrent modifications
Versioning allows rollbacks
Encryption keeps sensitive data secure
Team members can collaborate without fear

Implementation Tips:

Use S3 with DynamoDB for AWS (built-in locking)
Enable versioning on your state bucket
Implement proper IAM policies
Consider separate state files for different components
Always encrypt state at rest and in transit

Sin #5: The Count Trap

The Problem: Using count for creating multiple resources.

The count parameter seems innocent enough. Need three web servers? Just use count = 3. But here’s the trap: Terraform identifies counted resources by their index. Remove the middle server, and Terraform thinks the third server is now the second, triggering a destroy and recreate.

Real-World Horror Story: A team used count for their microservices. They removed one service from the middle of the list. Result? Half their production services were recreated, causing 30 minutes of downtime.

Why for_each Is Superior:

Resources are identified by keys, not position
Add, remove, or reorder without affecting others
More expressive and self-documenting
Works beautifully with maps and sets

When to Use What:

Use count only for truly identical resources (like read replicas)
Use for_each when resources have any unique properties
Use for_each when the list might change over time
Default to for_each when in doubt

Sin #6: Module Monoliths

The Problem: Creating “god modules” that do everything.

It’s tempting to create one module that handles your entire application stack. VPC, EKS, RDS, ElastiCache, S3, CloudFront – why not put it all together? Because you’ve just created an unmaintainable, inflexible monster that no one can reuse.

Why This Fails:

Can’t reuse parts of the module
Testing becomes nearly impossible
Version updates affect everything
147 variables that no one understands
One size fits nobody

The Module Philosophy: Think of modules like LEGO blocks, not like pre-built castles. Each module should:

Do one thing well
Be composable with other modules
Have a clear interface (inputs/outputs)
Be testable in isolation
Be versioned independently

Module Best Practices:

Separate Repository: Each module gets its own repo for independent versioning
Semantic Versioning: Use vMajor.Minor.Patch for clear upgrade paths
Examples Included: Show how to use the module
Comprehensive Outputs: Expose what others might need
Optional Variables: Use Terraform’s optional() for backwards compatibility

Sin #7: Zero Documentation

The Problem: “The code is self-documenting!” (Narrator: It wasn’t.)

Six months later, no one knows why the RDS backup window is at 3:47 AM, why there are exactly 7 subnets, or what that weird IAM policy is for. The original developer has left, and now every change is a risky adventure.

What Documentation Should Answer:

Why did we make this choice? (not what – the code shows that)
What are the dependencies and prerequisites?
How does this integrate with other systems?
When should settings be changed?
Who should be contacted for questions?
How much will this cost to run?

Documentation Best Practices:

Document at the point of decision
Include cost estimates
Explain the “why” behind non-obvious choices
Add links to relevant documentation
Include recovery procedures
Document known limitations

Building Production-Ready Terraform Modules

Now that we’ve covered what NOT to do, let’s dive deep into building modules that are actually reusable, maintainable, and production-ready.

The Philosophy of Great Modules

Great modules aren’t just about organizing code – they’re about creating abstractions that make sense for your organization. Think of them as building blocks that encode your best practices, security requirements, and operational knowledge.

Module Design Principles

1. Single Responsibility Each module should have one clear purpose. A VPC module creates networking infrastructure. An RDS module creates databases. Don’t mix concerns.

2. Composability Over Completeness It’s better to have several focused modules that work together than one module that tries to do everything. This allows teams to mix and match based on their needs.

3. Convention Over Configuration Establish conventions and make them defaults. If your organization always uses specific tag names, encryption settings, or network configurations, build these into your modules.

4. Progressive Disclosure Make simple things simple and complex things possible. Use sensible defaults but allow overrides for advanced use cases.

Essential Module Components

Every production-ready module needs:

Clear Interface – Well-defined inputs and outputs
Validation – Catch errors early with variable validation
Documentation – README with examples and explanations
Testing – Automated tests to ensure functionality
Versioning – Semantic versioning for safe upgrades

Working with Dynamic Blocks

Dynamic blocks are powerful but can make code harder to read. Use them when you need flexibility, but document their behavior clearly.

When to Use Dynamic Blocks:

Optional features (like logging or monitoring)
Variable numbers of similar configurations
Environment-specific settings

When to Avoid:

Simple on/off features (use ternary operators instead)
When it makes the code significantly harder to understand

Leveraging Terraform Functions

Terraform’s built-in functions are powerful tools for creating flexible modules:

lookup() – Safe map access with defaults
try() – Graceful handling of optional values
can() – Validation and conditional logic
merge() – Combining maps for tag strategies
optional() – Backwards-compatible variable schemas

Understanding these functions is crucial for building robust modules that handle edge cases gracefully.

Module Development Workflow

1. Start with the Interface

Before writing any resources, design your module’s interface:

What inputs does it need?
What outputs should it provide?
What are sensible defaults?

2. Build Incrementally

Start with the minimum viable module and add features gradually. This helps maintain simplicity and ensures each addition is necessary.

3. Test Early and Often

Write tests alongside your module:

Unit tests for logic
Integration tests for resource creation
Example configurations for documentation

4. Version Thoughtfully

Use semantic versioning:

Major: Breaking changes
Minor: New features (backwards compatible)
Patch: Bug fixes

5. Document as You Go

Documentation written after the fact is often incomplete. Document decisions when you make them.

Practical Module Patterns

The VPC Module Pattern

A good VPC module should:

Create consistent network layouts
Handle multiple availability zones
Provide flexible CIDR allocation
Output subnet IDs for other modules
Include sensible security defaults

The Application Module Pattern

Application modules should:

Accept VPC information as inputs
Create all necessary resources for one application
Handle environment differences through variables
Include monitoring and alerting
Provide connection information as outputs

The Data Store Module Pattern

Database modules should:

Enforce encryption at rest
Handle backup configurations
Manage security groups
Provide connection strings
Support high availability options

Security Best Practices That Actually Matter

Security in Terraform isn’t just about not hardcoding passwords. It’s about building security into every aspect of your infrastructure code.

Secrets Management Strategy

Never Store Secrets in Code or State

Use AWS Secrets Manager or HashiCorp Vault
Generate random passwords within Terraform
Use IAM roles instead of access keys
Mark sensitive outputs appropriately

State File Security

Your state file contains sensitive information. Protect it like you would protect your production database:

Encryption at Rest: Always encrypt state files
Encryption in Transit: Use HTTPS/TLS for state operations
Access Control: Limit who can read/write state
Audit Logging: Track all state access
Backup Strategy: Regular backups with retention policies

Network Security Patterns

Build security into your modules:

Default to least privilege
Use security group rules that reference other security groups
Implement network segmentation
Enable VPC Flow Logs
Use AWS WAF for public-facing applications

Testing Your Terraform Code

Testing isn’t optional for production infrastructure. Here’s a practical approach:

Static Analysis

Start with the basics:

terraform fmt – Consistent formatting
terraform validate – Syntax checking
tflint – Linting for best practices
tfsec or checkov – Security scanning

Integration Testing

Test that your modules actually create working infrastructure:

Use Terratest for automated testing
Create temporary test environments
Verify resources are created correctly
Test connectivity and functionality
Clean up after tests

Pre-commit Hooks

Catch issues before they enter your repository:

repos:
  - repo: https://github.com/antonbabenko/pre-commit-terraform
    hooks:
      - id: terraform_fmt
      - id: terraform_validate
      - id: terraform_docs
      - id: terraform_tflint
      - id: terraform_tfsec

CI/CD Pipeline Essentials

A good CI/CD pipeline for Terraform should include multiple stages. Let me break down each component:

Stage 1: Validation

First, validate your code syntax and formatting:

- name: Terraform Format Check
  run: terraform fmt -check -recursive

- name: Terraform Init
  run: terraform init -backend=false

- name: Terraform Validate
  run: terraform validate

This catches basic errors before moving forward.

Stage 2: Security Scanning

Next, scan for security issues:

- name: TFSec Security Scan
  uses: aquasecurity/tfsec-action@v1.0.3
  with:
    soft_fail: true

- name: Checkov Scan
  uses: bridgecrewio/checkov-action@master
  with:
    directory: .

These tools catch common security misconfigurations.

Stage 3: Planning

Generate and review the plan:

- name: Terraform Plan
  run: |
    terraform plan \
      -var-file=environments/${{ matrix.environment }}/terraform.tfvars \
      -out=tfplan \
      -no-color

Stage 4: Apply (with Approval)

For production, always require manual approval:

- name: Wait for Approval
  uses: trstringer/manual-approval@v1
  with:
    approvers: platform-team
    minimum-approvals: 1

- name: Terraform Apply
  if: github.ref == 'refs/heads/main'
  run: terraform apply tfplan

Common Pitfalls to Avoid

Over-engineering simple infrastructure
Under-documenting complex decisions
Ignoring costs until the bill arrives
Skipping tests to save time (you won’t)
Not versioning modules properly
Mixing concerns in modules
Forgetting about disaster recovery
Not planning for multi-region
Hardcoding account IDs or regions
Using latest for module versions

Real-World Migration Strategies

Migrating from Manual Infrastructure

When importing existing infrastructure:

Inventory First: Document what exists before importing
Import Gradually: Start with stateless resources
Verify Continuously: Run plans after each import
Add Management Features: Like tagging and monitoring
Document Differences: Note any manual configurations

Migrating from Other IaC Tools

When moving from CloudFormation, CDK, or other tools:

Run in Parallel: Keep both systems during migration
Match Functionality: Ensure feature parity
Migrate by Service: Don’t try to do everything at once
Test Thoroughly: Especially data stores and stateful services
Plan Rollback: Have a clear rollback strategy

What’s Next?

Congratulations! You now have a comprehensive playbook for writing Terraform code that scales. But the journey doesn’t end here.

Keep Learning:

Join the Terraform community (Discord, Reddit, HashiCorp forums)
Contribute to open-source Terraform modules
Share your own modules and learnings
Stay updated with new Terraform features

Advanced Topics to Explore:

Terraform Cloud/Enterprise for team collaboration
Policy as Code with Sentinel or OPA
GitOps with Flux or ArgoCD
Cost Management with Infracost
Multi-cloud strategies and patterns

Remember: Great infrastructure code isn’t about being clever – it’s about being clear, consistent, and maintainable. Start with the basics, improve incrementally, and always prioritize clarity over cleverness.

Happy Terraforming!

Found this guide helpful? Share it with your team and let me know what Terraform challenges you’re facing. I’d love to hear about your infrastructure journey!