Terraform Best Practices: Writing Code That Won’t Make You Cry Later – T12
Hey there, Terraform warriors! Welcome to the ultimate guide on Terraform best practices.
Whether you’re managing a handful of resources or orchestrating enterprise-scale infrastructure, this guide will help you write Terraform code that your future self will actually thank you for.
I’ve spent years building, breaking, and rebuilding Terraform configurations. Today, I’m sharing everything I’ve learned so you can skip the painful parts and jump straight to writing beautiful, maintainable infrastructure code.
Why Most Terraform Projects Turn Into Nightmares
Let’s be honest – we’ve all been there. Your Terraform project starts as a beautiful, simple configuration. Three months later? It’s a sprawling monster that no one dares to touch. Here’s what typically goes wrong:
- Simple changes take days because the code is a tangled mess
- Teams are afraid to deploy because something always breaks
- New developers need weeks just to understand what’s going on
- Production incidents happen due to preventable configuration mistakes
- Module reusability is a myth because everything is hardcoded
But here’s the thing – it doesn’t have to be this way! With the right practices from day one, your Terraform code can be a joy to work with, even years down the line.
The 7 Deadly Sins of Terraform (And How to Avoid Them)
Sin #1: The 3,000-Line Monster File
The Problem: Everything crammed into one massive main.tf
file.
Imagine opening a file with 3,000+ lines where VPC configuration blends into database setup, which morphs into Lambda functions. Finding anything becomes a treasure hunt, and God help you if two developers need to work on different parts simultaneously – merge conflicts will become your daily nightmare.
Why This Happens: It usually starts innocently. You begin with a simple VPC and a few EC2 instances. Then you add RDS. Then Lambda. Before you know it, you’re scrolling for minutes just to find that one security group rule.
The Better Way: Think of your infrastructure like a well-organized kitchen. You wouldn’t throw all your utensils, plates, and food in one drawer, would you? Apply the same principle:
- Keep your main.tf under 100 lines – it should orchestrate, not implement
- Create separate files for logical components (networking.tf, compute.tf, database.tf)
- Use modules for reusable patterns
- Follow the single responsibility principle
Golden Rule: If you’re scrolling to find something, it’s time to split the file.
Sin #2: Copy-Paste Environment Hell
The Problem: Separate folders with duplicated code for each environment.
This approach seems logical at first – “Let’s keep dev separate from prod!” But what happens when you need to add a new security group rule? You update dev, test it, then copy to staging, then to prod. Miss one? Congratulations, your environments are now different, and you won’t know until something breaks.
Why This Is Dangerous:
- Configuration drift is inevitable
- Security patches become a multi-day ordeal
- Testing loses its meaning when environments differ
- “But it works in staging!” becomes your team’s anthem
The Better Way: One codebase, multiple configurations. Think of it like a recipe – you don’t write three different recipes for small, medium, and large portions. You have one recipe and adjust the quantities.
Use variable files (tfvars) to handle environment-specific values while keeping your infrastructure definition DRY (Don’t Repeat Yourself). This ensures that when you fix a bug or add a feature, it’s automatically available across all environments.
Sin #3: Hardcoding Everything
The Problem: Values baked directly into resources like concrete.
When you hardcode values, you’re essentially writing a love letter to technical debt. That t3.medium
instance type? What happens when you need t3.large
in production? That AMI ID? It’s region-specific and will break the moment you deploy elsewhere.
Why This Kills Scalability:
- Environment promotions require code changes
- Multi-region deployments become impossible
- Cost optimization requires touching every resource
- You can’t share modules between projects
The Better Way: Variables are your friends, but smart variables are your best friends. Don’t just extract values into variables – add validation, descriptions, and sensible defaults. Think about:
- What might change between environments?
- What might change between regions?
- What might change as the application scales?
- What would a new team member need to know?
Use variable validation to catch errors early. If someone tries to deploy a t2.nano
for a production database, wouldn’t you rather catch that during plan than at 3 AM when the site is down?
Sin #4: Local State Files (The Collaboration Killer)
The Problem: State files living on developer laptops.
This is like keeping the only copy of your house keys in your pocket while going swimming in the ocean. Local state files are disasters waiting to happen:
- Laptop dies? Your infrastructure is now orphaned
- Two developers run apply simultaneously? State corruption
- Need to rollback? Hope you have that old state file somewhere
- Audit requirements? Good luck explaining your “process”
Why Remote State Is Non-Negotiable: Remote state isn’t just about backup – it’s about collaboration, consistency, and confidence. With remote state:
- State locking prevents concurrent modifications
- Versioning allows rollbacks
- Encryption keeps sensitive data secure
- Team members can collaborate without fear
Implementation Tips:
- Use S3 with DynamoDB for AWS (built-in locking)
- Enable versioning on your state bucket
- Implement proper IAM policies
- Consider separate state files for different components
- Always encrypt state at rest and in transit
Sin #5: The Count Trap
The Problem: Using count
for creating multiple resources.
The count
parameter seems innocent enough. Need three web servers? Just use count = 3
. But here’s the trap: Terraform identifies counted resources by their index. Remove the middle server, and Terraform thinks the third server is now the second, triggering a destroy and recreate.
Real-World Horror Story: A team used count for their microservices. They removed one service from the middle of the list. Result? Half their production services were recreated, causing 30 minutes of downtime.
Why for_each Is Superior:
- Resources are identified by keys, not position
- Add, remove, or reorder without affecting others
- More expressive and self-documenting
- Works beautifully with maps and sets
When to Use What:
- Use
count
only for truly identical resources (like read replicas) - Use
for_each
when resources have any unique properties - Use
for_each
when the list might change over time - Default to
for_each
when in doubt
Sin #6: Module Monoliths
The Problem: Creating “god modules” that do everything.
It’s tempting to create one module that handles your entire application stack. VPC, EKS, RDS, ElastiCache, S3, CloudFront – why not put it all together? Because you’ve just created an unmaintainable, inflexible monster that no one can reuse.
Why This Fails:
- Can’t reuse parts of the module
- Testing becomes nearly impossible
- Version updates affect everything
- 147 variables that no one understands
- One size fits nobody
The Module Philosophy: Think of modules like LEGO blocks, not like pre-built castles. Each module should:
- Do one thing well
- Be composable with other modules
- Have a clear interface (inputs/outputs)
- Be testable in isolation
- Be versioned independently
Module Best Practices:
- Separate Repository: Each module gets its own repo for independent versioning
- Semantic Versioning: Use vMajor.Minor.Patch for clear upgrade paths
- Examples Included: Show how to use the module
- Comprehensive Outputs: Expose what others might need
- Optional Variables: Use Terraform’s
optional()
for backwards compatibility
Sin #7: Zero Documentation
The Problem: “The code is self-documenting!” (Narrator: It wasn’t.)
Six months later, no one knows why the RDS backup window is at 3:47 AM, why there are exactly 7 subnets, or what that weird IAM policy is for. The original developer has left, and now every change is a risky adventure.
What Documentation Should Answer:
- Why did we make this choice? (not what – the code shows that)
- What are the dependencies and prerequisites?
- How does this integrate with other systems?
- When should settings be changed?
- Who should be contacted for questions?
- How much will this cost to run?
Documentation Best Practices:
- Document at the point of decision
- Include cost estimates
- Explain the “why” behind non-obvious choices
- Add links to relevant documentation
- Include recovery procedures
- Document known limitations
Building Production-Ready Terraform Modules
Now that we’ve covered what NOT to do, let’s dive deep into building modules that are actually reusable, maintainable, and production-ready.
The Philosophy of Great Modules
Great modules aren’t just about organizing code – they’re about creating abstractions that make sense for your organization. Think of them as building blocks that encode your best practices, security requirements, and operational knowledge.
Module Design Principles
1. Single Responsibility Each module should have one clear purpose. A VPC module creates networking infrastructure. An RDS module creates databases. Don’t mix concerns.
2. Composability Over Completeness It’s better to have several focused modules that work together than one module that tries to do everything. This allows teams to mix and match based on their needs.
3. Convention Over Configuration Establish conventions and make them defaults. If your organization always uses specific tag names, encryption settings, or network configurations, build these into your modules.
4. Progressive Disclosure Make simple things simple and complex things possible. Use sensible defaults but allow overrides for advanced use cases.
Essential Module Components
Every production-ready module needs:
- Clear Interface – Well-defined inputs and outputs
- Validation – Catch errors early with variable validation
- Documentation – README with examples and explanations
- Testing – Automated tests to ensure functionality
- Versioning – Semantic versioning for safe upgrades
Working with Dynamic Blocks
Dynamic blocks are powerful but can make code harder to read. Use them when you need flexibility, but document their behavior clearly.
When to Use Dynamic Blocks:
- Optional features (like logging or monitoring)
- Variable numbers of similar configurations
- Environment-specific settings
When to Avoid:
- Simple on/off features (use ternary operators instead)
- When it makes the code significantly harder to understand
Leveraging Terraform Functions
Terraform’s built-in functions are powerful tools for creating flexible modules:
lookup()
– Safe map access with defaultstry()
– Graceful handling of optional valuescan()
– Validation and conditional logicmerge()
– Combining maps for tag strategiesoptional()
– Backwards-compatible variable schemas
Understanding these functions is crucial for building robust modules that handle edge cases gracefully.
Module Development Workflow
1. Start with the Interface
Before writing any resources, design your module’s interface:
- What inputs does it need?
- What outputs should it provide?
- What are sensible defaults?
2. Build Incrementally
Start with the minimum viable module and add features gradually. This helps maintain simplicity and ensures each addition is necessary.
3. Test Early and Often
Write tests alongside your module:
- Unit tests for logic
- Integration tests for resource creation
- Example configurations for documentation
4. Version Thoughtfully
Use semantic versioning:
- Major: Breaking changes
- Minor: New features (backwards compatible)
- Patch: Bug fixes
5. Document as You Go
Documentation written after the fact is often incomplete. Document decisions when you make them.
Practical Module Patterns
The VPC Module Pattern
A good VPC module should:
- Create consistent network layouts
- Handle multiple availability zones
- Provide flexible CIDR allocation
- Output subnet IDs for other modules
- Include sensible security defaults
The Application Module Pattern
Application modules should:
- Accept VPC information as inputs
- Create all necessary resources for one application
- Handle environment differences through variables
- Include monitoring and alerting
- Provide connection information as outputs
The Data Store Module Pattern
Database modules should:
- Enforce encryption at rest
- Handle backup configurations
- Manage security groups
- Provide connection strings
- Support high availability options
Security Best Practices That Actually Matter
Security in Terraform isn’t just about not hardcoding passwords. It’s about building security into every aspect of your infrastructure code.
Secrets Management Strategy
Never Store Secrets in Code or State
- Use AWS Secrets Manager or HashiCorp Vault
- Generate random passwords within Terraform
- Use IAM roles instead of access keys
- Mark sensitive outputs appropriately
State File Security
Your state file contains sensitive information. Protect it like you would protect your production database:
- Encryption at Rest: Always encrypt state files
- Encryption in Transit: Use HTTPS/TLS for state operations
- Access Control: Limit who can read/write state
- Audit Logging: Track all state access
- Backup Strategy: Regular backups with retention policies
Network Security Patterns
Build security into your modules:
- Default to least privilege
- Use security group rules that reference other security groups
- Implement network segmentation
- Enable VPC Flow Logs
- Use AWS WAF for public-facing applications
Testing Your Terraform Code
Testing isn’t optional for production infrastructure. Here’s a practical approach:
Static Analysis
Start with the basics:
terraform fmt
– Consistent formattingterraform validate
– Syntax checkingtflint
– Linting for best practicestfsec
orcheckov
– Security scanning
Integration Testing
Test that your modules actually create working infrastructure:
- Use Terratest for automated testing
- Create temporary test environments
- Verify resources are created correctly
- Test connectivity and functionality
- Clean up after tests
Pre-commit Hooks
Catch issues before they enter your repository:
repos:
- repo: https://github.com/antonbabenko/pre-commit-terraform
hooks:
- id: terraform_fmt
- id: terraform_validate
- id: terraform_docs
- id: terraform_tflint
- id: terraform_tfsec
CI/CD Pipeline Essentials
A good CI/CD pipeline for Terraform should include multiple stages. Let me break down each component:
Stage 1: Validation
First, validate your code syntax and formatting:
- name: Terraform Format Check
run: terraform fmt -check -recursive
- name: Terraform Init
run: terraform init -backend=false
- name: Terraform Validate
run: terraform validate
This catches basic errors before moving forward.
Stage 2: Security Scanning
Next, scan for security issues:
- name: TFSec Security Scan
uses: aquasecurity/tfsec-action@v1.0.3
with:
soft_fail: true
- name: Checkov Scan
uses: bridgecrewio/checkov-action@master
with:
directory: .
These tools catch common security misconfigurations.
Stage 3: Planning
Generate and review the plan:
- name: Terraform Plan
run: |
terraform plan \
-var-file=environments/${{ matrix.environment }}/terraform.tfvars \
-out=tfplan \
-no-color
Stage 4: Apply (with Approval)
For production, always require manual approval:
- name: Wait for Approval
uses: trstringer/manual-approval@v1
with:
approvers: platform-team
minimum-approvals: 1
- name: Terraform Apply
if: github.ref == 'refs/heads/main'
run: terraform apply tfplan
Common Pitfalls to Avoid
- Over-engineering simple infrastructure
- Under-documenting complex decisions
- Ignoring costs until the bill arrives
- Skipping tests to save time (you won’t)
- Not versioning modules properly
- Mixing concerns in modules
- Forgetting about disaster recovery
- Not planning for multi-region
- Hardcoding account IDs or regions
- Using latest for module versions
Real-World Migration Strategies
Migrating from Manual Infrastructure
When importing existing infrastructure:
- Inventory First: Document what exists before importing
- Import Gradually: Start with stateless resources
- Verify Continuously: Run plans after each import
- Add Management Features: Like tagging and monitoring
- Document Differences: Note any manual configurations
Migrating from Other IaC Tools
When moving from CloudFormation, CDK, or other tools:
- Run in Parallel: Keep both systems during migration
- Match Functionality: Ensure feature parity
- Migrate by Service: Don’t try to do everything at once
- Test Thoroughly: Especially data stores and stateful services
- Plan Rollback: Have a clear rollback strategy
What’s Next?
Congratulations! You now have a comprehensive playbook for writing Terraform code that scales. But the journey doesn’t end here.
Keep Learning:
- Join the Terraform community (Discord, Reddit, HashiCorp forums)
- Contribute to open-source Terraform modules
- Share your own modules and learnings
- Stay updated with new Terraform features
Advanced Topics to Explore:
- Terraform Cloud/Enterprise for team collaboration
- Policy as Code with Sentinel or OPA
- GitOps with Flux or ArgoCD
- Cost Management with Infracost
- Multi-cloud strategies and patterns
Remember: Great infrastructure code isn’t about being clever – it’s about being clear, consistent, and maintainable. Start with the basics, improve incrementally, and always prioritize clarity over cleverness.
Happy Terraforming!
Found this guide helpful? Share it with your team and let me know what Terraform challenges you’re facing. I’d love to hear about your infrastructure journey!