25-Week AWS DevOps + MLOPS + AIOPS Bootcamp
Details
- Fee: 70k INR ($750)
- Total Classes: 50
- Duration: 180 minutes
- Format: Live Classes
- Starting On: 18th April 2026
- Classes on: Saturday and Sunday
- Timings: 11 AM IST- 2 PM IST
- Language: English
This 25-week bootcamp by Akhilesh Mishra focuses on implementing real-world projects with live troubleshooting & production-level details on Cloud Devops. MLops, AIops, and DevSecOps to make you stand out in a tough job market.
- You will learn everything by doing all real-world projects, troubleshooting live issues, and navigating live production incidents. For the next 6 months, you will live the experience of a DevOps Engineer.
- This bootcamp will teach you how to use AI in every workflow to get more productive.
- Whether you are trying to transition into Devops from a different domain or improve your existing Devops skills, it helps both.
Pre-requisites: Basic Linux and AWS knowledge (I will provide you with resources to go through before the class)
Bootcamp Details
Module: AWS + Containers
Week 1: AWS Fundamentals + running apps a 2-tier app on EC2
Class 1: Cloud Computing + AWS + Custom VPC + EC2
- History of computing — physical → virtualisation → cloud
- Cloud models — IaaS, PaaS, SaaS, AWS global infrastructure, shared responsibility
- VPC, subnets (public/private), Internet Gateway, route tables
- Security Groups vs NACLs — stateful vs stateless, when to use each
- NAT Gateway — private subnet internet access without exposure
- Bastion host pattern — why you never SSH directly to private instances
- EC2 instance types, AMIs, purchase options (On-Demand, Spot, Reserved)
Project: Production VPC from Scratch
- 6 subnets (2 public, 2 private app, 2 private DB) across 2 AZs
- NAT Gateway, S3 Gateway endpoint, route tables for each tier
- Security Groups per tier with least-privilege rules, NACLs on DB subnets
- Bastion host — SSH to private instance through it, direct access blocked
Class 2: Running a flask app on EC2 + nginx reverse proxy +Elastic IP + DNS + Free SSL
- Install PostgreSQL on EC2
- Run a 2-tier app on EC2
- Elastic IP, Route 53 hosted zones, A records, alias records
- ACM free SSL cert — DNS validation, attaching to resources
- Using AI copilots for faster workflow and troubleshooting
Project: Web App on EC2 with Custom Network, DNS + HTTPS
- Running a 2-tier app on EC2 with a production-grade setup
- Elastic IP, Route 53 A record, HTTPS
- nginx for reverse proxy, free SSL certs from Let’s Encrypt — HTTPS end-to-end
Week 2: Database migration on RDS + App with Load Balancing & Auto Scaling with production best practies + S3
Class 3: Running scalable app on EC2 with ALB + Auto Scaling Groups
- ALB vs NLB — when to use which, listeners, rules, target groups, health checks
- ALB path-based routing, HTTPS termination, HTTP → HTTPS redirect
- Launch Templates, ASG — desired/min/max, AZ balancing
- Scaling policies — target tracking, step scaling, scheduled
- User data in ASG — pulling config from SSM on launch
- Connection draining, instance refresh for zero-downtime AMI updates
Project 1: Migrating a PostgreSQL database running on an EC2 to RDS
Project: 2-Tier App with ALB + ASG + RDS + Route 53 + ACM
- Launch Template with User data, ASG across 2 private subnets, min 2, max 6
- ALB with HTTPS, health checks, and target tracking on CPU
- RDS PostgreSQL in private DB subnets, credentials in secret manager
- Database encryption at rest with customer-managed KMS key
- Simulate — terminate instance, scale under load, verify zero downtime
Class 4: S3 + CloudFront + Cross-Region Replication
- S3 storage classes, lifecycle policies, versioning, bucket policies
- Static website hosting, pre-signed URLs, S3 event notifications
- CloudFront — CDN, Origin Access Control, cache invalidation
- S3 replication — SRR vs CRR, IAM roles, delete marker replication
- VPC Endpoints — S3 Gateway endpoint, why it saves cost and improves security
- IAM Instance Profile for S3 access, SSH only via Bastion
Project 1: Static App on S3 + CloudFront + Custom Domain + SSL
- S3 with OAC, CloudFront distribution, ACM cert in us-east-1, Route 53 alias record
Project 2: Real-Time Cross-Region Replication
- Source bucket ap-south-1 → destination us-east-1, versioning on both, lifecycle policy on destination
- Privately accessing S3 objects from a private EC2 VM
Week 3: RDS production patterens + IAM + Disaster Recovery
Class 5: RDS Patterns + Disaster Recovery
- RDS Multi-AZ — synchronous replication, automatic failover
- Read Replicas — async replication, read scaling, cross-region DR
- Multi-AZ vs Read Replica — the difference most engineers get wrong
- RDS snapshots, point-in-time recovery, Aurora basics, RDS Proxy
- DR patterns — Backup & Restore, Pilot Light, Warm Standby, Multi-Site Active-Active
- RTO vs RPO — how to calculate, how to pick the right strategy
- RDS proxy for connection pooling
Project: RDS Multi-AZ Failover Simulation + Cross-Region Read Replica + DR
- Trigger manual failover, measure downtime, verify app reconnects automatically
- Promote Read Replica in the second region, update Route 53
- Disaster recovery from snapshots
Class 6: IAM Deep Dive + Production Security Hardening
- IAM users, groups, roles, policies — why roles always beat users for apps
- Policy structure — Effect, Action, Resource, Condition
- Permission boundaries, cross-account role assumption, trust policies
- SSM Session Manager — modern zero-trust alternative to Bastion/SSH
- Secrets Manager vs SSM Parameter Store — when to use each
- CloudTrail, AWS Config, IAM Access Analyzer — audit and compliance
- IAM Identity Center — federated login for both accounts, no IAM users for humans
- IAM role in prod, trust policy for dev account, developer assumes role via CLI
Project: Lock Down the Stack with IAM + Session Manager
- Remove SSH, enable SSM Session Manager, and least-privilege Instance Profile
- Configure third-party bucket access only from certain IP’s
- Troubleshooting unauthorized bucket access and locking it down for only authorized access
Week 4: Containerization using Docker and Docker Compose
- Evolution of Docker, its architecture, and everyday commands
- Building, sharing, and running custom Docker images
- Container networking, volume management with meaningful use cases
- Docker build optimization with intelligent layering and multi-stage build. and security best practices
- Docker Compose for multi-container applications, app dependencies, and health checks
- Simulating real-time applications, load testing, and alerting
- Migrating the app running on a VM to containers
- Using an AI-based CLI for automated troubleshooting and simple tasks
Project
- Migrating an app running on VM → Containerize a Flask + PostgreSQL app running on EC2 — write the Dockerfile from scratch, shrink the image with multi-stage builds, run it locally with Docker Compose mirroring the production setup.
- Optimising the Docker image to reduce size and speed up builds
- Real-time containers monitoring with a dashboard and an alert system with AWS SES email notifications and stress testing for different load scenarios
Module: Running Containers on production( AWS ECS ) + Infrastructure as code (Terraform) + CICD (GitHub Action)
Week 5: ECS + Terraform Foundations
Class 1: Two-Tier App on ECS (Console)
- Migrating a Dockerised app running on EC2 to ECS, planning, design and execution
- ECS fundamentals — clusters, services, task definitions, Fargate vs EC2 launch type
- Task definition anatomy — container definitions, CPU/memory, environment variables, secrets
- ECS service — desired count, deployment config, circuit breaker
- ALB integration with ECS — target group, dynamic port mapping, health checks
- ECS service discovery — AWS Cloud Map, how containers find each other
- CloudWatch Container Insights — logs, metrics, container-level visibility
- ECS IAM — task role vs task execution role, what each one does
- Secrets Manager + ECS — injecting secrets into containers without hardcoding
Project: Deploy a Database-Backed 2-Tier App on ECS
- VM App Migration to AWS ECS: Migrating a containerised app running on EC2 to ECS Fargate — move secrets to Secrets Manager, upload files to S3, and cut over traffic from EC2 to ECS using Route 53 weighted routing with zero downtime.
- ECS Fargate cluster, task definition for Flask app connecting to RDS PostgreSQL
- ALB with HTTPS, health check on /health, CloudWatch log group per container
- Secrets Manager for DB credentials injected at runtime, task role for S3 access
Class 2: Terraform Fundamentals
- Terraform architecture — providers, resources, state, plan, apply lifecycle
- Writing your first resources — VPC, EC2, S3 in Terraform
- Variables, locals, outputs — making configs reusable
terraform.tfvarsand environment-specific variable files- Remote state — S3 backend, why local state breaks teams
- Terraform workspaces — managing dev and prod from the same codebase
terraform import— bringing existing resources under Terraform management- Terraform plan best practices — what to review before every apply
- Common mistakes — hardcoded values, missing state locking, giant single files
Project: Convert Manual AWS Setup to Terraform
- Recreate the Week 2 VPC (subnets, NAT, security groups) entirely in Terraform
- Remote state in S3 with DynamoDB lock, workspace for dev and prod
terraform importThe manually created RDS from Week 3 into the state
Week 6: ECS with Terraform + Multi-Environment
Class 3: Full ECS Infrastructure with Terraform
- Terraform module structure for ECS — VPC, ALB, ECS, RDS as separate modules
- ECS service and task definition in Terraform — every config option explained
- ALB in Terraform — listeners, rules, target groups, ACM cert attachment
- RDS in Terraform — subnet groups, parameter groups, Multi-AZ toggle per environment
- CloudWatch log groups, metric alarms, and dashboards in Terraform
- Terraform
for_eachandcount— creating multiple similar resources cleanly depends_onand resource ordering — avoiding race conditions on apply- Terraform module versioning — pinning modules for stability
Project: Dev and Production ECS Environments with Terraform
- Single module codebase, two workspaces — dev (single AZ, smaller instances) and prod (Multi-AZ, larger)
- Full stack — VPC, ALB, ECS Fargate, RDS, ACM, CloudWatch — all in Terraform
terraform planoutput reviewed and applied cleanly for both environments
Class 4: Git, GitHub + CI/CD Fundamentals
- Git fundamentals — commits, branches, merge vs rebase, resolving conflicts
- Branching strategies — Gitflow vs trunk-based, what real teams actually use
- GitHub Actions architecture — workflows, jobs, steps, runners
- Triggers —
push,pull_request,workflow_dispatch,schedule - Matrix build testing across Node 18 and Node 20
- GitHub action authentication with AWS
- Secrets and environment variables in GitHub Actions — repo secrets vs environment secrets
- GitHub Environments — approval gates before deploying to production
- GitHub Actions matrix builds — testing across multiple versions in parallel
- SDLC and Jira — how tickets flow from backlog to deployed feature in real teams
Project: Multi-Stage CI Pipeline with Automated Testing
- GitHub Actions pipeline — lint → unit test → build Docker image → push to ECR
- Pull request check — pipeline must pass before merge is allowed
Week 7: Automated Deployments + Scaling + Monitoring
Class 5: Automated ECS Deployments with GitHub Actions
- Container image versioning — git SHA tagging, semantic versioning, latest anti-pattern
- ECR lifecycle policies — cleaning up old images automatically
- ECS deployment strategies — rolling update, blue-green via CodeDeploy
- GitHub Actions ECS deploy —
aws-actions/amazon-ecs-deploy-task-definition - Environment-specific workflows — dev deploys on merge to main, prod requires approval
- Rollback strategy — redeploying the previous task definition on failure
- Deployment notifications — Slack alerts on success and failure
- Testing in CI — running integration tests against a staging ECS environment before prod
Project: Automated build and Deployment Pipeline for ECS
- GitHub Actions builds an image on every merge, tags with the git SHA, and pushes to ECR
- Automatic rollback if health checks fail within 5 minutes
- Email notification on deployment success, failure, and rollback
Class 6: ECS Auto Scaling + Load Testing + Monitoring
- ECS service auto scaling — Application Auto Scaling, target tracking on CPU and ALB request count
- ECS task-level scaling vs service-level scaling — understanding the difference
- CloudWatch custom metrics — pushing app-level metrics from containers
- CloudWatch dashboards — ECS service health, ALB latency, RDS connections in one view
- CloudWatch alarms — composite alarms, alarm actions (SNS → email/Slack)
- AWS X-Ray for distributed tracing on ECS — enabling, reading traces
- Load testing with
k6orhey— simulating real traffic, finding bottlenecks - Reading CloudWatch Container Insights under load — what to look for
Project: Load Test + Auto Scaling + Monitoring Dashboard
- Target tracking policy — scale ECS tasks when ALB request count per target exceeds threshold
- Run k6 load test, watch tasks scale out, verify ALB distributes traffic
- CloudWatch dashboard showing ECS CPU, ALB 5xx rate, RDS connection count, p99 latency
- Composite alarm — fires Slack alert when both high CPU and high error rate occur together
Week 8: Advanced Terraform + Security + OIDC
Class 7: Three-Tier App with Advanced Terraform
- Advanced Terraform modules — public registry vs private, module composition patterns
- Data sources — referencing existing resources without hardcoding ARNs
- Multi-environment strategy — dev, staging, prod with shared modules and separate state files
- CloudFront in Terraform — distribution, origins, cache behaviours, OAC for S3
- RDS in Terraform — automated backups, snapshot retention, Multi-AZ for prod
- Disaster recovery in Terraform — cross-region Read Replica, automated snapshot copy
- Terraform
lifecycleblocks —prevent_destroy,create_before_destroy,ignore_changes - Terraform drift detection —
terraform planin CI to catch manual changes
Project: Three-Tier App with DR Strategy in Terraform
- Full three-tier stack — CloudFront → ALB → ECS → RDS across dev, staging, prod
- Cross-region RDS Read Replica in Terraform, automated snapshot copy to a second region
- CloudFront + S3 for static assets, ECS for API, RDS for data — all managed in Terraform
prevent_destroyon RDS and S3,create_before_destroyon ECS task definitions
Class 8: OIDC + Keyless Authentication + Security Hardening + resuabel Github Action workflows
- Why access keys in CI/CD are dangerous — rotation burden, leak risk, audit gaps
- OIDC fundamentals — how GitHub proves its identity to AWS without a password
- Setting up OIDC provider in AWS IAM — thumbprint, audience, provider URL
- Keyless Terraform in CI —
aws-actions/configure-aws-credentialswith OIDC - Fine-grained OIDC conditions — locking roles to specific repos, branches, environments
- Separate IAM roles per environment — dev pipeline cannot touch prod resources
- Permission boundaries on CI roles — hard ceiling on what any pipeline can do
- Full end-to-end deploy — code push → OIDC auth → Terraform plan → ECS deploy — zero static credentials anywhere
Project:
- Writing reusable workflows —
workflow_call, composite actions - Setting up OIDC to avoid long-lived Credentials for CI/CD (GitHub Action) workflows
Module: Python for Devops
Week 9: AWS Lambda + Serverless Automation
Lambda Fundamentals + Event-Driven Architecture
- Deep Dive + API Automation
requestsmodule — GET, POST, PUT, DELETE, auth, retries with backoff- boto3 architecture — sessions, clients vs resources, regions, profiles
- Paginating through AWS APIs with
get_paginator - Python logging best practices — levels, formatters, rotating file handlers
- JSON parsing and validation for API responses
- Environment-based config management —
.env,os.environ, secrets handling - Lambda execution model — cold starts, warm starts, concurrency limits
- Function anatomy — handler, event object, context object
- IAM roles and least-privilege security for Lambda
- Triggers — S3, SQS, SNS, EventBridge (cron and event-based)
- Environment variables and secrets management in Lambda
- Lambda deployment — zip packaging, console vs CLI vs Terraform
- Dead letter queues for failed invocation handling
- CloudWatch Logs integration and structured logging from Lambda
- Lambda Layers — packaging dependencies, sharing code across functions
- Reserved and provisioned concurrency — when and why
- Terraform for Lambda — packaging, deploying, versioning, IAM, triggers all in code
- Multi-Lambda orchestration patterns — chaining, fan-out, fan-in
- API Gateway + Lambda integration — proxy vs non-proxy, request/response mapping
- Lambda testing locally with
python-lambda-localand mocked events
Projects
- A lambda function to automate security issues on AWS IAM on a schedule
- S3 and SQS triggered Lambda functions workflow for automating file processing
Week 10: Production Security + Compliance Automation
Multilevel Image Processing Pipeline — Production Deep Dive
- Designing multi-Lambda architectures for real workloads
- S3 event triggers chained across multiple Lambda functions
- Error handling between Lambda stages — retries, DLQ, alerting
- Lambda concurrency management for high-throughput pipelines
- Monitoring pipeline health with CloudWatch metrics and alarms
- Cost optimization — right-sizing memory, reducing cold starts
Project: Production image processing API — client uploads image → S3 trigger → validate format/size → transform (resize, watermark, convert format) → store to clean bucket → SES notification with processed image link. Full error handling, DLQ for failed images, CloudWatch dashboard
ClamAV File Scanning Automation for S3 Security
subprocessfor running ClamAV scans from Python- ClamAV setup, freshclam for virus DB updates, scan result parsing (return codes)
- S3 event notification → SQS → Python consumer pattern
- Downloading files from the landing bucket, scanning locally, and routing based on the result
- S3 object tagging — Clean/Infected with
put_object_tagging - Multi-account AWS architecture — landing account vs clean account
- SES email alerts for infected files with file metadata
- Production error handling — what happens if ClamAV crashes, S3 download fails, SQS message is malformed
Project: Banking Compliance File Scanner — S3 upload triggers SQS message → Python consumer downloads file → ClamAV scans → tags object Clean/Infected → routes clean files to processing bucket → blocks infected files → SES alert to security team with filename, bucket, timestamp, and scan output. Full logging, retry logic, and dead letter queue for failed scans
Week 11: FinOps Automation + RDS Migration with Full Infrastructure
RDS Cost Analysis + Migration Planning Automation
- RDS cost breakdown — instance type, storage, IOPS, multi-AZ, snapshots
- boto3 for pulling RDS metrics — CPU, connections, storage utilization via CloudWatch
- Identifying migration candidates — underutilized instances, oversized storage
pg_dumpandpg_restorefrom Python subprocess — full and schema-only dumpspgsyncfor live data migration with minimal downtime- Data validation post-migration — row counts, checksum comparison, schema diff
- Migration rollback strategy — when to cut back, how to keep source alive
- Python script for pre-migration health check and post-migration validation report
- Containerizing the migration script with Docker — Dockerfile, entrypoint, env vars
- Deploying migration job on AWS ECS (Fargate task, not a long-running service)
- Lambda trigger for the ECS task — one-click or scheduled migration kick-off
- Terraform for the complete stack — ECS cluster, task definition, IAM roles, Lambda, EventBridge, VPC networking, security groups, RDS parameter groups
pgsynclive migration — running inside ECS container with source and target RDS connections- Post-migration validation Lambda — runs row count checks, sends final report via SES
- Live demo — trigger migration from Lambda, watch ECS task run, validate data, confirm zero data loss
Project: End-to-end RDS Migration Platform — full Terraform-provisioned infrastructure, ECS Fargate runs the migration container with pgsync, Lambda triggers and monitors the job, CloudWatch dashboard shows live progress, SNS/SES sends status updates at each stage, post-migration validation script confirms data integrity. Production-ready, reusable for any PostgreSQL migration on AWS
Kubernetes Module — Weeks 13–17
Week 12: Kubernetes Fundamentals + Local Clusters
Kubernetes Architecture + Core Concepts
- The why behind Kubernetes — what broke before it existed
- Control plane deep dive — API server, etcd, scheduler, controller manager
- Worker node components — kubelet, kube-proxy, container runtime
- Core objects — Pod, ReplicaSet, Deployment, Service
- Setting up Minikube locally, kubectl basics and everyday commands
- YAML manifests in depth — apiVersion, kind, metadata, spec
- ConfigMaps and Secrets — creating, mounting as env vars and volumes
- Namespaces and resource organisation
- Labels, selectors, annotations — how Kubernetes finds things
- Resource requests and limits — why they matter in production
- Kubernetes DNS and service discovery internals
- ImagePullSecrets for private registries
- Lens (Freelens) — Kubernetes IDE for visual cluster management
Project: Deploy a 2-tier e-commerce app (frontend + PostgreSQL) on Minikube — wired together with Services, ConfigMaps, Secrets, private image registry
Resilience Patterns, Autoscaling + Live Debugging
- Liveness, readiness, and startup probes — real failure scenarios
- Rolling upgrades and rollback strategies
- HPA and VPA — pod autoscaling based on CPU/memory/custom metrics
- Init containers and sidecar patterns
- Pod Disruption Budgets for zero-downtime deployments
- Deployment strategies — Recreate vs RollingUpdate vs Blue-Green
- CrashLoopBackOff, OOMKilled — live debugging techniques
- Resource quotas and LimitRanges per namespace
- Reading Kubernetes events to diagnose failures fast
- StatefulSets intro — ordered deployment, stable network identity
- DaemonSets and Jobs — when to use each
- Persistent Volumes, PVCs, StorageClass — concepts and local demo
Project: Add HPA, probes, and PodDisruptionBudget to the e-commerce app. Simulate CrashLoopBackOff and OOMKilled failures live and debug them. Add a PostgreSQL StatefulSet with persistent local storage
Week 13: CI/CD, GitOps + Production EKS Foundation
GitOps with ArgoCD + CI/CD Pipeline on Minikube
- GitOps fundamentals — why GitOps over push-based deployments
- ArgoCD setup on Minikube — apps, sync policies, health checks
- End-to-end CI/CD pipeline — GitHub Actions builds image, ArgoCD deploys
- ArgoCD app-of-apps pattern intro
- Branching strategy for GitOps — app repo vs config repo separation
- Rollback with ArgoCD — one-click vs automated
- Basic Prometheus + Grafana on Minikube — request rate, pod health dashboards
- Kubernetes events and alerts with AlertManager basics
- Debugging failed ArgoCD syncs — common causes and fixes
- Multi-environment GitOps intro — dev vs prod namespaces on same cluster
Project: GitHub Actions pipeline builds and pushes e-commerce image on every commit, ArgoCD auto-deploys to Minikube, basic Grafana dashboard showing pod health and request rate, rollback demonstrated live
Production EKS Setup + Networking + Security Foundations
- EKS cluster setup via eksctl and AWS console
- EKS add-ons — VPC CNI, CoreDNS, EBS CSI Driver, kube-proxy
- Helm — writing, packaging, deploying charts, values management
- IRSA — Kubernetes to AWS IAM with OIDC, no hardcoded credentials
- AWS Load Balancer Controller with Helm — architecture and annotations
- Ingress for internal and external traffic routing
- ExternalDNS for automatic Route53 record management
- Domain, SSL/TLS termination with ACM
- EKS managed node groups vs self-managed nodes — when to use which
- Kubernetes RBAC — ServiceAccounts, ClusterRoles, RoleBindings, least privilege
- aws-auth ConfigMap and RBAC for cluster access control
- EKS managed add-on vs self-managed — upgrade strategies
Project: EKS cluster up with eksctl, AWS Load Balancer Controller and ExternalDNS deployed via Helm, RBAC hardened, custom domain with SSL termination working
Week 14: 3-Tier App on EKS + StatefulSets + Storage
Running 3-Tier App on EKS + AWS Integrations
- Running 3-tier app — frontend + backend + RDS PostgreSQL on EKS
- Database migrations using Kubernetes Jobs
- Init containers for DB connection readiness checks
- IRSA in practice — backend pod accessing Secrets Manager without credentials
- AWS Secrets Manager integration — External Secrets Operator pattern
- Ingress rules for routing traffic to frontend vs backend
- Health checks at load balancer level vs pod level
- Blue-Green deployment on EKS with weighted routing
- Namespace strategy for multi-tier apps
- Real troubleshooting — ImagePullBackOff, pending pods, service not reachable
Project: Full 3-tier e-commerce app on EKS — frontend + Node.js backend + RDS PostgreSQL, IRSA for Secrets Manager, DB migration Job, custom domain, SSL, live troubleshooting of staged failures
StatefulSets, Persistent Storage + Image Optimisation
- StatefulSets deep dive — production patterns and failure recovery
- PersistentVolume, PVC, StorageClass — static vs dynamic provisioning on EKS
- EBS vs EFS — choosing the right storage for the workload
- Headless Services for StatefulSet DNS resolution
- Troubleshooting multi-attach volume errors and common StatefulSet failures
- Volume snapshots and backup strategies on EKS
- Multi-stage Docker builds — drastically smaller production images
- Distroless and minimal base images for attack surface reduction
- Trivy — container image vulnerability scanning in CI pipeline
- Docker image optimisation — layer caching, build context, .dockerignore
Project: Add MinIO as a StatefulSet with persistent EBS storage to the e-commerce app for product image uploads. Rebuild all images with multi-stage builds, integrate Trivy in GitHub Actions, reduce image sizes by 60%+
Week 15: Terraform EKS + Microservices + Advanced GitOps
Production EKS with Terraform
- Production EKS cluster with Terraform — VPC, subnets, node groups, add-ons
- Terraform module structure for EKS — separation of concerns
- Managing dev/staging/prod with Terraform workspaces
- Deploying AWS Load Balancer Controller and ExternalDNS via Terraform
- IRSA setup via Terraform — no manual console steps
- Terraform state management for EKS — remote backend, locking
- Importing existing EKS resources into Terraform state
- Terraform drift detection on EKS infrastructure
- Node group configuration — instance types, spot vs on-demand, taints and tolerations
- EKS upgrade strategy with Terraform — node group rotation
Project: Rebuild the entire EKS cluster from scratch with Terraform — VPC, node groups, add-ons, IRSA, Load Balancer Controller, ExternalDNS all provisioned via code. Zero manual console steps
Microservices on EKS + Advanced ArgoCD GitOps
- Microservices design principles — bounded context, single responsibility
- Splitting monolith into microservices — frontend, order, inventory, user service
- Inter-service communication — ClusterIP vs headless vs service mesh
- Network Policies for microservice traffic isolation between namespaces
- Gateway API — advanced ingress routing vs traditional Ingress
- ArgoCD App-of-Apps pattern — managing many services cleanly
- ArgoCD ApplicationSet for environment promotion across dev/staging/prod
- Helm chart per microservice — templating, values per environment
- Matrix builds in GitHub Actions for multiple microservices
- Reusable GitHub Actions workflows with composite actions
- Kubecost/OpenCost — namespace-level cost attribution per service
Project: E-commerce app split into 4 microservices each with own Helm chart and ArgoCD Application, App-of-Apps managing all services, Gateway API routing, matrix CI/CD builds, OpenCost showing per-service spend
Week 16: Full Observability Stack
Metrics, Logs + Dashboards
- How observability works in real production companies
- Prometheus — metrics collection, PromQL, scrape configs
- Prometheus Operator and ServiceMonitor CRDs
- Loki for log storage and querying — LogQL basics
- Fluent Bit on EKS — log aggregation, filtering, routing to Loki
- Grafana dashboards — Kubernetes cluster, app metrics, AWS resource metrics
- AlertManager — routing alerts to Slack and PagerDuty with grouping and silencing
- CloudWatch Container Insights integration alongside Prometheus
- Monitoring differences — Fargate vs managed node groups
- Cost visibility dashboard — RDS, Lambda, EKS node costs in Grafana
Project: Prometheus + Loki + Grafana + Fluent Bit deployed on EKS, Grafana dashboard showing order volume, error rates, DB query latency, AlertManager fires Slack alert when order service error rate crosses 1%, CloudWatch Container Insights alongside
Distributed Tracing, SLOs + Advanced Alerting
- OpenTelemetry for distributed tracing — instrumentation, collectors, exporters
- Tracing a request across frontend → order service → inventory service → DB
- Jaeger or Tempo as tracing backend — setup and querying
- SLO and SLI definitions — what they mean in practice
- Error budget dashboards in Grafana — how teams use them day to day
- Multi-window, multi-burn-rate alerting for SLOs
- AlertManager advanced — inhibition rules, routing trees, deduplication
- Runbook links in alerts — connecting alert to action
- Log-based alerting in Grafana with Loki rules
- Observability for stateful services — what’s different about monitoring databases
- Live debugging 10 real Kubernetes interview scenarios — staged failures on the e-commerce cluster
Project: OpenTelemetry tracing across all e-commerce microservices, Tempo as backend, Grafana trace explorer showing end-to-end request flow, SLO dashboards with error budget burn rate, multi-window AlertManager rules for order service
Week 17: Service Mesh, Karpenter + Security
Service Mesh + Network Policies + Zero Trust
- Service mesh fundamentals — why it exists, what problems it actually solves
- Istio installation and architecture — control plane, data plane, sidecars
- mTLS between all microservices — automatic, no code changes
- Traffic management — VirtualService, DestinationRule, Gateway
- Canary deployments with Istio traffic splitting — 10% to new version
- Visualising service mesh traffic with Kiali
- Network Policies for zero-trust pod-to-pod communication
- Egress controls and namespace isolation
- Pod topology spread constraints for multi-AZ resilience
- Istio observability — built-in metrics, tracing integration with Jaeger
Project: Istio deployed on e-commerce EKS cluster, mTLS enforced between all microservices, canary deployment for order service routing 10% traffic to v2, Kiali showing live traffic topology, network policies blocking all non-essential pod communication
Karpenter, EKS Auto Mode + Cost Optimisation
- Karpenter architecture — how it differs from Cluster Autoscaler
- NodePool and EC2NodeClass configuration
- Cost optimisation with Spot + On-Demand mixed node fleets
- Karpenter bin packing and consolidation policies — removing underutilised nodes
- Taints, tolerations, and node selectors with Karpenter
- EKS Auto Mode — what it is, when to use it over Karpenter
- Pod topology spread constraints across AZs with Karpenter
- Kyverno policy enforcement — blocking deployments without resource limits
- OPA Gatekeeper vs Kyverno — when to use which
- Pod security admission — restricted, baseline, privileged modes
- Security contexts and pod security standards in practice
Project: Karpenter deployed on EKS, replacing managed node group for inventory service, Spot instances with on-demand fallback, Kyverno policies enforced, blocking any deployment without resource limits and liveness probes, pod security standards applied cluster-wide
Module 6 — DevSecOps
Week 18: DevSecOps – Shift Left + Runtime Security + Policy as Code
- DevSecOps on Kubernetes — shifting security left in the pipeline
- Security integrated into the pipeline — SAST with Semgrep, DAST with OWASP ZAP against staging, SCA and dependency auditing with Trivy and Grype, pre-commit secret scanning and GitHub secret scanning.
- Container supply chain security — Trivy blocking critical CVEs before image push, image signing with Cosign, SBOM generation with Syft.
- Falco for runtime threat detection on EKS — custom rules for suspicious process execution, file access, and network activity.
- Kyverno policy engine — blocking unsigned images, requiring resource limits, enforcing labels and security contexts.
- Pod Security Admission — restricted, baseline, and privileged modes. CIS Kubernetes Benchmark scanning with kube-bench and remediation. Network policies for zero-trust between namespaces.
- Project: Complete DevSecOps pipeline — Trivy image scan blocks critical CVEs, Checkov lints manifests before apply, Falco fires alerts on suspicious runtime activity, Cosign signs all images, Kyverno rejects unsigned images. All enforced in GitHub Actions before anything reaches EKS
Module 7 — Site Reliability Engineering (SRE)
Week 19: SRE Principles + War Rooms + Chaos Engineering
- SLI, SLO, SLA
- precise definitions, how to write measurable objectives, and how to use error budgets to make deployment decisions rather than gut feel.
- DORA metrics — deployment frequency, lead time for changes, MTTR, and change failure rate as a team health indicator.
- On-call culture, runbook writing discipline, escalation path design, and handoff practice. Postmortem culture — blameless analysis, timeline reconstruction, contributing factors, and writing RCAs that prevent recurrence.
- Chaos engineering with LitmusChaos — pod failure, network delay, CPU stress, and disk fill experiments with defined steady-state hypotheses and blast radius limits.
- Three live war room simulations — OOMKill cascade, DB connection pool exhaustion, and Karpenter provisioning failure — each followed by a written RCA.
- Advanced Live Troubleshooting + Kubernetes Interview Scenarios
- Common production failure patterns — node pressure, evictions, DNS failures, RBAC misconfiguration, webhook timeouts
- Debugging networking — pod-to-pod, pod-to-service, ingress, CNI issues
- etcd health checks and control plane debugging on EKS
- Interpreting
kubectl describe, events, and pod logs together - Kubernetes system design questions — walk through 5 real scenarios
- How to think through and answer K8s design questions in interviews
Module: MLOPS
Week 20: ML Foundations for Platform Engineers + Pipelines
- MLops from a platform engineer’s perspective — what carries over from DevOps and what is genuinely new.
- MLflow for experiment tracking, model registry, artifact storage on S3, and model promotion through Staging to Production.
- DVC for data version control — treating datasets like code for reproducible training runs.
- Kubeflow Pipelines on EKS — DAG-based workflows, parameterised pipelines, cached steps, and artifact tracking.
- Argo Workflows as an alternative for ML — DAG vs steps mode, when to prefer Argo over Kubeflow.
- GPU node configuration on EKS — NVIDIA device plugin, node taints and tolerations.
- Spot instances for training jobs via Karpenter NodePool with on-demand fallback.
Week 21: Model Serving + CI/CD for Machine Learning
- Model serving patterns — batch inference, real-time REST, async queue-based, and streaming.
- BentoML for packaging models as containerised services and deploying on EKS.
- SageMaker real-time endpoints vs self-hosted EKS serving — latency, cost, and operational control trade-offs.
- KEDA for inference autoscaling on SQS queue depth.
- Canary model rollout using Istio traffic splitting — 10% to new model, promote on metric threshold.
- Champion vs challenger pattern in CI/CD — automated comparison, promote only if challenger beats champion on accuracy and latency.
- OIDC for keyless ML pipeline auth to AWS from GitHub Actions (same pattern as Module 3).
Week 22: ML Observability + Drift Detection + MLops in Production
- Data drift and concept drift — what they are, why they silently break models, and how to detect each.
- Evidently AI as a sidecar to the inference service for data quality reports and drift scoring.
- Prometheus custom metrics from inference — prediction confidence, volume, and latency histograms.
- Grafana dashboard showing prediction distribution shift over time and drift score trends.
- AlertManager firing on drift threshold breach and triggering automated Kubeflow retraining — full closed loop from detection to redeployment.
- Shadow mode deployment for validating a new model against production traffic before any live exposure.
- Ground truth collection patterns and delayed label feedback loops in real production ML systems.
Module: AIops
Week 23: LLM Infrastructure on AWS + RAG for Operations
- LLM deployment options on AWS — Bedrock for managed models (Claude, Llama, Titan), SageMaker JumpStart, and self-hosted open-source models on EKS with vLLM.
- Embedding models and vector databases — pgvector on RDS, OpenSearch vector engine — chunking strategies for technical documents. Building a RAG system that indexes all runbooks, postmortems, and architecture decision records from the project.
- FastAPI inference service on EKS accepts incident descriptions, retrieves the top relevant documents, sends to Bedrock, and returns a structured response with probable cause, suggested commands, and runbook link.
- Continuous re-ingestion is triggered by a Git webhook when runbooks are updated. LLM security — prompt injection risks in operations contexts, guardrails, and input validation.
- Running local LLM and sandboxed OpenCLAW to automate simple tasks
- Building custom MCP servers to create cost reports, creatinga nd deleting aws resources from simple prompts
Week 24: Agentic AIops + Self-Healing Infrastructure
Agentic AI architecture for operations — ReAct pattern, tool calling, and decision loops with constrained action spaces.
LLM-powered incident triage — AlertManager webhook triggers a Lambda agent that queries Loki for recent logs, queries Prometheus for metrics, retrieves relevant runbook sections from the RAG system, and posts a structured Slack message with severity, probable cause, and first three recommended actions in under 60 seconds.
Self-healing infrastructure patterns — detect, decide, act, and verify. Automated remediation with guardrails — blast radius limits, dry-run mode, rollback triggers, and full audit logging of every automated action.
EventBridge plus Lambda plus Bedrock for event-driven self-healing. LLM-driven cost optimisation — analysing Cost Explorer output and generating rightsizing recommendations.
Module: Getting ready for the job market
Week 25: Resume + Portfolio + Interview Preparation
Resume, Portfolio + Final Interview Prep
Recorded mock interviews — full recorded mock interview sessions covering DevOps, Kubernetes, MLops, and system design questions shared with all students.
Group sessions — live group interview practice sessions where students interview each other with feedback from Akhilesh. Resume framework — action verbs, metric-driven project impact statements, and how to present bootcamp work as production experience to a hiring manager.
System design interview — walking through 5 real-world DevOps and MLops design questions with structured thinking methodology. How senior engineers think about trade-offs in interviews — cost vs reliability vs complexity framing.
Kubernetes scenario-based questions — the 10 most common live debugging scenarios asked in senior interviews. What hiring managers actually look for in a DevOps, SRE, or MLops engineer in 2026.
Bonus Module
Personal career help, Guidance with Akhilesh Mishra
- 1:1 (Resume reviews + mock interview) calls with Akhilesh Kishra
- LinkedIn and resume review by Akhilesh — personal feedback, not automated.
- Referrals to opportunities in the LivingDevOps network where there is a genuine fit.
- GitHub portfolio cleanup — README structure, architecture diagrams, ADRs, and decision logs that show engineering judgment, not just code.
What you get out of this Bootcamp
- 25 weeks of live instruction · 50 classes, 3 hours each
- Pre-bootcamp Linux + AWS recorded sessions sent on enrolment
- 20+ real-world production-grade projects
- Lifetime access to all recordings, code, notes, and resources
- Private GitHub organisation — all cohort projects with real PR-based collaboration
- Recorded mock interview sessions — shared with all students
- Live group interview practice sessions
- LivingDevOps Discord — cohort community + alumni network
- Architecture decision records for every major technology choice in the curriculum
- FinOps cost analysis thread — real AWS cost implications shown every week
- LinkedIn profile review and resume feedback from Akhilesh personally
- Referrals to relevant opportunities in the network where there is a genuine fit
- Certificate of completion — DevOps + MLops + AIops
This curriculum follows a logical, incremental learning path from Linux fundamentals to advanced Kubernetes projects, ensuring each concept builds upon previous knowledge
Reach out for Queries
- Email:livingdevops@gmail.com
- WhatsApp: +91 9259681620
Reach out for Queries, Part payment requests
- Email:livingdevops@gmail.com
- WhatsApp: +91 9259681620
Testimonials
Sandeep
Hemant kumar
Ameet Khemani
Avinash V
Varsha Gore
Sajitha
Gaurav Dubey
Kajal
Sandeep
Hemant kumar
Ameet Khemani
Avinash V
Varsha Gore
Sajitha
Gaurav Dubey
Kajal
