25-Week AWS DevOps + MLOPS + AIOPS Bootcamp with Real World Projects

Details

Advanced DevOps Bootcamp with AWS, ECS, Python, Kubernetes, MLOps, AIOps, Devsecops & SRE (Real-World Projects)

  • By the end of this bootcamp, you will be able to design, deploy, and operate production-grade infrastructure on AWS — from containerised applications on EKS to ML model pipelines with drift detection to AI-powered incident triage, and a lot more
  • Bootcamp is mainly focused on building real-world projects with production-level details and live troubleshooting
  • Every week is built around how senior DevOps and platform engineers actually work.
  • Whether you are transitioning into DevOps or a senior engineer moving into cloud-native, MLOps, and AIOps, this bootcamp gives you the production experience that recruiters are asking for in 2026

Pre-requisites: Basic Linux and AWS knowledge (I will provide you with resources/past recordings to go through before the class)

Bootcamp structure

Module 1: (2-week) Getting comfortable with AWS


This week is about getting comfortable with my teaching style while learning the basics about AWS

Class 1

  • Creating a custom VPC and setting up full networking (private and public access)
  • setting up Nat for outbound only access
  • Understanding how people use AWS in production
  • Using IAM roles, inline/custom policies
  • Bastion host pattern — why you never SSH directly to private instances
  • IAM Instance Profile for S3 access, SSH only via Bastion
  • VPC Endpoints — S3 Gateway endpoint, why it saves cost and improves security
  • EC2 login without SSH port and public IP (with SSM system manager)
  • VPC Peering vs Transit Gateway — connecting VPCs

Project: Production VPC from Scratch 6 subnets (2 public, 2 private app, 2 private DB) across 2 AZs NAT Gateway, S3 Gateway endpoint, route tables for each tier Security Groups per tier with least-privilege rules, NACLs on DB subnets Bastion host — SSH to private instance through it, direct access blocked

Class2: Storage

  • Discussing how different storage types are used in the real world
  • S3 storage classes, lifecycle policies, versioning, bucket policies
  • Static website hosting, pre-signed URLs, S3 event notifications
  • CloudFront — CDN, Origin Access Control, cache invalidation
  • S3 replication — SRR vs CRR, IAM roles, delete marker replication

Project 1: Static App on S3 + CloudFront + Custom Domain + SSL S3 with OAC, CloudFront distribution, ACM cert in us-east-1, Route 53 alias record Project 2: Real-Time Cross-Region Replication Source bucket ap-south-1 → destination us-east-1, versioning on both, lifecycle policy on destination

Class 1: Running a PostgreSQL-backed 2-tier app on EC2

  • Install PostgreSQL on EC2
  • Running a 2-tier app on EC2 with a production-grade setup
  • Elastic IP, Route 53 hosted zones, A records, alias records
  • Let’s Encrypt free SSL cert — DNS validation, attaching to resources
  • Elastic IP, Route 53 A record, HTTPS
  • nginx for reverse proxy, free SSL certs from Let’s Encrypt — HTTPS end-to-end
  • Using AI copilots for faster workflow and troubleshooting

Project: Running a flask app on EC2 + nginx reverse proxy +Elastic IP + DNS + Free SSL

Class 2: Auto scaling groups, load balancers to run a RDS-backed 2-tier app

– ALB vs NLB — when to use which, listeners, rules, target groups, health checks

– ALB path-based routing, HTTPS termination, HTTP → HTTPS redirect Launch Templates, ASG — desired/min/max, AZ balancing

– Scaling policies — target tracking, step scaling, scheduled User data in ASG — pulling config from SSM on launch Connection draining, instance refresh for zero-downtime AMI updates

Project: 2-Tier App with ALB + ASG + RDS + Route 53 + ACM Launch Template with User data, ASG across 2 private subnets, min 2 max 6 ALB with HTTPS, health checks, target tracking on CPU RDS PostgreSQL in private DB subnets, credentials in SSM Parameter Store Simulate — terminate instance, scale under load, verify zero downtime

Project: Running a 2-tier app on EC2 with autoscaling, Load balancers


Module 2: ( 4-week ) Running Containers on production + Terraform + GitHub Action CICD

Class 1: Docker + running a two-tier App on ECS (Console)

  • Docker fundamentals and how it works in production
  • Building, sharing, and running custom Docker images
  • Container networking, volume management with meaningful use cases
  • Docker Compose for multi-container applications, app dependencies, and health checks
  • Migrating the app running on a VM to containers
  • Migrating a Dockerised app running on EC2 to ECS, planning, design and execution
  • ECS fundamentals — clusters, services, task definitions, Fargate vs EC2 launch type
  • Task definition anatomy — container definitions, CPU/memory, environment variables, secrets
  • ECS service — desired count, deployment config, circuit breaker
  • ALB integration with ECS — target group, dynamic port mapping, health checks
  • ECS IAM — task role vs task execution role, what each one does
  • Secrets Manager + ECS — injecting secrets into containers without hardcoding
  • VM App Migration to AWS ECS: Migrating a containerised app running on EC2 to ECS Fargate — move secrets to Secrets Manager, upload files to S3, and cut over traffic from EC2 to ECS using Route 53 weighted routing with zero downtime.
  • ECS Fargate cluster, task definition for Flask app connecting to RDS PostgreSQL
  • ALB with HTTPS, health check on /health, CloudWatch log group per container
  • Secrets Manager for DB credentials injected at runtime, task role for S3 access

Project: Deploy a RDS Database-Backed 2-Tier App on ECS


Class 3: Full ECS Infrastructure with Terraform

  • Terraform architecture — providers, resources, state, plan, apply lifecycle
  • Writing your first resources — VPC, EC2, S3 in Terraform
  • Variables, locals, outputs — making configs reusable
  • Remote state — S3 backend, why local state breaks teams
  • terraform.tfvars and environment-specific variable files
  • terraform import — bringing existing resources under Terraform management
  • Common mistakes — hardcoded values, missing state locking, giant single files
  • Terraform module structure for ECS — VPC, ALB, ECS, RDS as separate modules
  • ECS service and task definition in Terraform — every config option explained
  • ALB in Terraform — listeners, rules, target groups, ACM cert attachment
  • RDS in Terraform — subnet groups, parameter groups, Multi-AZ toggle per environment
  • Terraform for_each and count — creating multiple similar resources cleanly
  • depends_on and resource ordering — avoiding race conditions on apply
  • Terraform module versioning — pinning modules for stability

Project: Dev and Production ECS Environments with Terraform

  • Single module codebase, two workspaces — dev (single AZ, smaller instances) and prod (Multi-AZ, larger)
  • Full stack — VPC, ALB, ECS Fargate, RDS, ACM, CloudWatch — all in Terraform
  • terraform plan output reviewed and applied cleanly for both environments

Class 4: Git, GitHub + CI/CD Fundamentals

  • Git fundamentals — commits, branches, merge vs rebase, resolving conflicts
  • Branching strategies — Gitflow vs trunk-based, what real teams actually use
  • GitHub Actions architecture — workflows, jobs, steps, runners
  • Triggers — push, pull_request, workflow_dispatch, schedule
  • Matrix build testing across Node 18 and Node 20
  • GitHub action authentication with AWS
  • Secrets and environment variables in GitHub Actions — repo secrets vs environment secrets
  • GitHub Environments — approval gates before deploying to production
  • GitHub Actions matrix builds — testing across multiple versions in parallel
  • SDLC and Jira — how tickets flow from backlog to deployed feature in real teams

Project: Multi-Stage CI Pipeline with Automated Testing

  • GitHub Actions pipeline — lint → unit test → build Docker image → push to ECR
  • Pull request check — pipeline must pass before merge is allowed

Class 5: Automated ECS Deployments with GitHub Actions

  • Container image versioning — git SHA tagging, semantic versioning, latest anti-pattern
  • ECR lifecycle policies — cleaning up old images automatically
  • OIDC + Keyless Authentication + Security Hardening
  • ECS deployment strategies — rolling update, blue-green via CodeDeploy
  • GitHub Actions ECS deploy — aws-actions/amazon-ecs-deploy-task-definition
  • Environment-specific workflows — dev deploys on merge to main, prod requires approval
  • Rollback strategy — redeploying the previous task definition on failure
  • Deployment notifications — Slack alerts on success and failure
  • Testing in CI — running integration tests against a staging ECS environment before prod

Project: Automated build and Deployment Pipeline for ECS

  • GitHub Actions builds an image on every merge, tags with the git SHA, and pushes to ECR
  • Automatic rollback if health checks fail within 5 minutes
  • Email notification on deployment success, failure, and rollback

Class 6: ECS Auto Scaling + Load Testing + Monitoring

  • ECS service auto scaling — Application Auto Scaling, target tracking on CPU and ALB request count
  • ECS task-level scaling vs service-level scaling — understanding the difference
  • CloudWatch custom metrics — pushing app-level metrics from containers
  • CloudWatch dashboards — ECS service health, ALB latency, RDS connections in one view
  • CloudWatch alarms — composite alarms, alarm actions (SNS → email/Slack)
  • AWS X-Ray for distributed tracing on ECS — enabling, reading traces
  • Load testing with k6 or hey — simulating real traffic, finding bottlenecks
  • Reading CloudWatch Container Insights under load — what to look for

Project: Load Test + Auto Scaling + Monitoring Dashboard

  • Target tracking policy — scale ECS tasks when ALB request count per target exceeds threshold
  • Run k6 load test, watch tasks scale out, verify ALB distributes traffic
  • CloudWatch dashboard showing ECS CPU, ALB 5xx rate, RDS connection count, p99 latency
  • Composite alarm — fires Slack alert when both high CPU and high error rate occur together

Class 7: Three-Tier App with Advanced Terraform

  • Advanced Terraform modules — public registry vs private, module composition patterns
  • Data sources — referencing existing resources without hardcoding ARNs
  • Multi-environment strategy — dev, prod with shared modules and separate state files
  • Importing existing EKS resources into Terraform state
  • CloudFront in Terraform — distribution, origins, cache behaviours, OAC for S3
  • RDS in Terraform — automated backups, snapshot retention, Multi-AZ for prod
  • RDS Disaster recovery, cross-region Read Replica, automated snapshot copy
  • Terraform lifecycle blocks — prevent_destroy, create_before_destroy, ignore_changes
  • Terraform drift detection — terraform plan in CI to catch manual changes

Project: Three-Tier App with DR Strategy in Terraform

  • Full three-tier stack — CloudFront → ALB → ECS → RDS across dev, staging, prod
  • RDS Multi-AZ — synchronous replication, automatic failover
  • Read Replicas — async replication, read scaling, cross-region DR
  • Multi-AZ vs Read Replica — the difference most engineers get wrong
  • RDS snapshots, point-in-time recovery, Aurora basics, RDS Proxy
  • DR patterns — Backup & Restore, Pilot Light, Warm Standby, Multi-Site Active-Active
  • RDS proxy for connection pooling

Class 8: OIDC for Github Action + reusable Github Action workflows

  • Why access keys in CI/CD are dangerous — rotation burden, leak risk, audit gaps
  • OIDC fundamentals — how GitHub proves its identity to AWS without a password
  • Setting up OIDC provider in AWS IAM — thumbprint, audience, provider URL
  • Keyless Terraform in CI — aws-actions/configure-aws-credentials with OIDC
  • Fine-grained OIDC conditions — locking roles to specific repos, branches, environments
  • Full end-to-end deploy — code push → OIDC auth → Terraform plan → ECS deploy — zero static credentials anywhere

Project:

  • Writing reusable workflows — workflow_call, composite actions
  • Setting up OIDC to avoid long-lived Credentials for CI/CD (GitHub Action) workflows

Module 3: (3-week) Python for Devops


Class 1: Python for DevOps + Boto3 Deep Dive

  • Python environment setup, virtual environments, and project structure
  • Data structures DevOps engineers actually use — dicts, lists, sets for parsing API responses
  • os and subprocess modules — running shell commands, reading system state from Python
  • File operations, JSON/YAML parsing
  • Error handling and exception management
  • Creating reusable Python modules and a custom CLI
  • Working with JSON and YAML — parsing, validating, transforming config and API responses
  • Python logging best practices — levels, formatters, rotating file handlers

Project: AWS Resource Audit CLI Build a production-grade Python CLI that connects to AWS via boto3, paginates through EC2, S3, and RDS resources across regions, generates a formatted report of all running resources with costs, and logs every operation. Runs on a schedule or on demand.


Class 2: Working with API’s + CRUD operations

  • Uses requests module to make API CRUD requests (Create, Read, Update, Delete)
  • Implements proper error handling for API calls
  • Handles API authentication and headers
  • Parses and validates API responses
  • boto3 architecture — sessions, clients vs resources, regions, profiles
  • Paginating through AWS APIs with get_paginator — why it matters at scale
  • Environment-based config management — .env, os.environ, secrets handling
  • Error handling and exception management for AWS API calls

Project: API CRUD Automation Script Python script using the requests module to make full CRUD operations against a REST API — proper auth headers, error handling, response validation, and structured logging for every call.



Class 3: Lambda — Event Processing + Lambda layers

  • Lambda execution model — cold starts, warm starts, concurrency limits, what they cost
  • Function anatomy — handler, event object, context object
  • IAM roles and least-privilege security for Lambda
  • Triggers — S3, SQS, SNS, EventBridge cron and event-based
  • Environment variables and secrets management in Lambda
  • Lambda deployment — zip packaging, console vs CLI
  • CloudWatch Logs integration and structured logging from Lambda
  • Lambda Layers — packaging dependencies, sharing code across functions
  • Integration with SQS, SNS, and S3 for event processing
  • EventBridge for event routing and processing
  • Error handling between Lambda stages — retries, DLQ, alerting
  • Lambda concurrency management for high-throughput pipelines
  • Monitoring pipeline health with CloudWatch metrics and alarms
  • Dead letter queues for failed invocation handling

Project 1: IAM Key Rotation Lambda Lambda function on an EventBridge schedule that scans all IAM users, identifies keys older than 90 days, rotates them, stores new keys in Secrets Manager, and sends an SES email report with key ages and rotation status.

Project 2: Daily Cloud Cost Report Lambda EventBridge-triggered Lambda that pulls Cost Explorer data via boto3, formats a per-service cost breakdown, compares against last week, and emails the report via SES every morning.


Class 4: ClamAV File Scanning Automation for S3 Security

  • subprocess for running system tools from Python — the right way
  • ClamAV setup, freshclam for virus DB updates, scan result parsing via return codes
  • S3 event notification → SQS → Python consumer pattern end to end
  • Downloading files from landing bucket, scanning locally, routing based on result
  • S3 object tagging — Clean/Infected with put_object_tagging
  • Multi-account AWS architecture — landing account vs clean account
  • SES email alerts for infected files with full metadata
  • Production error handling — ClamAV crash, S3 download failure, malformed SQS message

Project: Banking Compliance File Scanner S3 upload triggers SQS message → Python consumer downloads file → ClamAV scans → tags object Clean or Infected → routes clean files to processing bucket → blocks infected files → SES alert to security team with filename, bucket, timestamp, and full scan output. Full logging, retry logic, and dead letter queue for failed scans.

Class 5: FinOps Automation + Cost Optimization Scripts

  • API Gateway + Lambda integration — proxy vs non-proxy, request/response mapping
  • Multi-Lambda orchestration patterns — chaining, fan-out, fan-in
  • Lambda Cost optimization — right-sizing memory, reducing cold starts on lambda functions
  • RDS cost breakdown — instance type, storage, IOPS, Multi-AZ, snapshots
  • boto3 for pulling RDS and EC2 metrics via CloudWatch — CPU, connections, storage utilization
  • Identifying rightsizing candidates — underutilized instances, oversized storage
  • EC2 cost analysis — finding idle instances, stopped instances still costing money
  • Building a FinOps report that runs weekly and emails recommendations

Project:

– Project: Multilevel Image Processing Pipeline Client uploads image → S3 trigger → Lambda 1 validates format and size → Lambda 2 transforms (resize, watermark, convert format) → stores to clean bucket → Lambda 3 sends SES notification with processed image link. Full error handling, DLQ for failed images, CloudWatch dashboard showing pipeline health.

– RDS usage report generation for FinOps analysis on cost-saving opportunities


Class 6: RDS Migration Automation

  • Migration planning — pre-migration health check script, validating source DB before starting
  • pg_dump and pg_restore from Python subprocess — full and schema-only dumps
  • pgsync for live data migration with minimal downtime
  • Data validation post-migration — row counts, checksum comparison, schema diff
  • Migration rollback strategy — when to cut back, how to keep the source alive
  • Post-migration validation script — automated integrity checks with SES report
  • Containerizing the migration script with Docker — Dockerfile, entrypoint, env vars
  • Deploying a migration job on ECS Fargate as a one-off task, not a long-running service
  • Lambda trigger for the ECS task — one-click or scheduled migration kickoff

Project: End-to-End RDS Migration Platform Python migration script containerized with Docker, deployed to ECS Fargate as a one-off task. Lambda triggers the migration on demand, pgsync runs inside the container with source and target RDS connections, post-migration validation Lambda runs row count and checksum checks, SES sends status updates at each stage. Production-ready and reusable for any PostgreSQL migration on AWS.


Module 4: (6 Weeks) of Kubernetes on AWS

Kubernetes Architecture + Core Concepts

  • The why behind Kubernetes — what broke before it existed
  • Control plane deep dive — API server, etcd, scheduler, controller manager
  • Worker node components — kubelet, kube-proxy, container runtime
  • Core objects — Pod, ReplicaSet, Deployment, Service
  • Setting up Minikube locally, kubectl basics and everyday commands
  • YAML manifests in depth — apiVersion, kind, metadata, spec
  • ConfigMaps and Secrets — creating, mounting as env vars and volumes
  • Namespaces and resource organisation
  • Labels, selectors, annotations — how Kubernetes finds things
  • Resource requests and limits — why they matter in production
  • Kubernetes DNS and service discovery internals
  • ImagePullSecrets for private registries
  • Lens (Freelens) — Kubernetes IDE for visual cluster management

Project: Deploy a 2-tier e-commerce app (frontend + PostgreSQL) on Minikube — wired together with Services, ConfigMaps, Secrets, private image registry


Resilience Patterns, Autoscaling + Live Debugging

  • Liveness, readiness, and startup probes — real failure scenarios
  • Rolling upgrades and rollback strategies
  • HPA and VPA — pod autoscaling based on CPU/memory/custom metrics
  • Init containers and sidecar patterns
  • Pod Disruption Budgets for zero-downtime deployments
  • Deployment strategies — Recreate vs RollingUpdate vs Blue-Green
  • CrashLoopBackOff, OOMKilled — live debugging techniques
  • Resource quotas and LimitRanges per namespace
  • Reading Kubernetes events to diagnose failures fast
  • StatefulSets intro — ordered deployment, stable network identity
  • DaemonSets and Jobs — when to use each

Project: Add HPA, probes, and PodDisruptionBudget to the e-commerce app. Simulate CrashLoopBackOff and OOMKilled failures live and debug them. Add a PostgreSQL StatefulSet with persistent local storage


GitOps with ArgoCD + CI/CD Pipeline on Minikube

  • GitOps fundamentals — why GitOps over push-based deployments
  • ArgoCD setup on Minikube — apps, sync policies, health checks
  • End-to-end CI/CD pipeline — GitHub Actions builds image, ArgoCD deploys
  • ArgoCD app-of-apps pattern intro
  • Branching strategy for GitOps — app repo vs config repo separation
  • Rollback with ArgoCD — one-click vs automated
  • Basic Prometheus + Grafana on Minikube — request rate, pod health dashboards
  • Debugging failed ArgoCD syncs — common causes and fixes
  • Multi-environment GitOps intro — dev vs prod namespaces on same cluster

Project: GitHub Actions pipeline builds and pushes e-commerce image on every commit, ArgoCD auto-deploys to Minikube, basic Grafana dashboard showing pod health and request rate, rollback demonstrated live


Production EKS Setup + Networking + Security Foundations

  • EKS cluster setup via eksctl and AWS console
  • EKS add-ons — VPC CNI, CoreDNS, EBS CSI Driver, kube-proxy
  • IRSA — Kubernetes to AWS IAM with OIDC, no hardcoded credentials
  • AWS Load Balancer Controller with Helm — architecture and annotations
  • Ingress for internal and external traffic routing
  • ExternalDNS for automatic Route53 record management
  • Domain, SSL/TLS termination with ACM
  • EKS managed node groups vs self-managed nodes — when to use which
  • EKS access entry for cluster access

Project: EKS cluster up with eksctl, AWS Load Balancer Controller and ExternalDNS deployed via Helm, RBAC hardened, custom domain with SSL termination working


Running 3-Tier App on EKS + AWS Integrations

  • Running 3-tier app — frontend + backend + RDS PostgreSQL on EKS
  • Database migrations using Kubernetes Jobs
  • Init containers for DB connection readiness checks
  • IRSA in practice — backend pod accessing Secrets Manager without credentials
  • AWS Secrets Manager integration — External Secrets Operator pattern
  • Ingress rules for routing traffic to frontend vs backend
  • Health checks at load balancer level vs pod level
  • Blue-Green deployment on EKS with weighted routing
  • Namespace strategy for multi-tier apps
  • Real troubleshooting — ImagePullBackOff, pending pods, service not reachable

Project: Full 3-tier app on EKS — frontend + Node.js backend + RDS PostgreSQL, IRSA for Secrets Manager, DB migration Job, custom domain, SSL, live troubleshooting of staged failures


StatefulSets, Persistent Storage + Docker Image Optimisation

  • StatefulSets deep dive — production patterns and failure recovery
  • PersistentVolume, PVC, StorageClass — static vs dynamic provisioning on EKS
  • EBS vs EFS — choosing the right storage for the workload
  • Headless Services for StatefulSet DNS resolution
  • Troubleshooting multi-attach volume errors and common StatefulSet failures
  • Volume snapshots and backup strategies on EKS
  • Multi-stage Docker builds — drastically smaller production images
  • Distroless and minimal base images for attack surface reduction
  • Docker image optimisation — layer caching, build context, .dockerignore

Project: Add MinIO as a StatefulSet with persistent EBS storage to the e-commerce app for product image uploads. Rebuild all images with multi-stage builds, integrate Trivy in GitHub Actions, and reduce image sizes by 60%+


Production EKS with Terraform

  • Production EKS cluster with Terraform — VPC, subnets, node groups, and add-ons
  • Terraform module structure for EKS — separation of concerns
  • Managing dev/staging/prod with Terraform workspaces
  • Deploying AWS Load Balancer Controller and ExternalDNS via Terraform
  • IRSA setup via Terraform — no manual console steps
  • Terraform drift detection on EKS infrastructure
  • Node group configuration — instance types, spot vs on-demand, taints and tolerations
  • EKS upgrade strategy with Terraform — node group rotation

Project: Rebuild the entire EKS cluster from scratch with Terraform — VPC, node groups, add-ons, IRSA, Load Balancer Controller, and ExternalDNS all provisioned via code. Zero manual console steps


Microservices on EKS + Advanced ArgoCD GitOps

  • Microservices design principles — bounded context, single responsibility
  • Splitting monolith into microservices — frontend, order, inventory, and user service
  • Inter-service communication — ClusterIP vs headless vs service mesh
  • Network Policies for microservice traffic isolation between namespaces
  • Gateway API — advanced ingress routing vs traditional Ingress
  • ArgoCD App-of-Apps pattern — managing many services cleanly
  • ArgoCD ApplicationSet for environment promotion across dev/staging/prod
  • Matrix builds in GitHub Actions for multiple microservices

Project: E-commerce app split into 4 microservices each with own Helm chart and ArgoCD Application, App-of-Apps managing all services, Gateway API routing, matrix CI/CD builds, OpenCost showing per-service spend


Metrics, Logs + Dashboards

  • How observability works in real production companies
  • Prometheus — metrics collection, PromQL, scrape configs
  • Prometheus Operator and ServiceMonitor CRDs
  • Loki for log storage and querying — LogQL basics
  • Fluent Bit on EKS — log aggregation, filtering, routing to Loki
  • Grafana dashboards — Kubernetes cluster, app metrics, AWS resource metrics
  • AlertManager — routing alerts to Slack and PagerDuty with grouping and silencing
  • CloudWatch Container Insights integration alongside Prometheus
  • Monitoring differences — Fargate vs managed node groups
  • Cost visibility dashboard — RDS, Lambda, EKS node costs in Grafana

Project: Prometheus + Loki + Grafana + Fluent Bit deployed on EKS, Grafana dashboard showing order volume, error rates, DB query latency, AlertManager fires Slack alert when order service error rate crosses 1%, CloudWatch Container Insights alongside


Distributed Tracing, SLOs + Advanced Alerting

  • OpenTelemetry for distributed tracing — instrumentation, collectors, exporters
  • Tracing a request across frontend → order service → inventory service → DB
  • Jaeger or Tempo as tracing backend — setup and querying
  • SLO and SLI definitions — what they mean in practice
  • Error budget dashboards in Grafana — how teams use them day to day
  • Multi-window, multi-burn-rate alerting for SLOs
  • AlertManager advanced — inhibition rules, routing trees, deduplication
  • Runbook links in alerts — connecting alert to action
  • Log-based alerting in Grafana with Loki rules
  • Observability for stateful services — what’s different about monitoring databases
  • Live debugging 10 real Kubernetes interview scenarios — staged failures on the e-commerce cluster
  • Advanced live troubleshooting — node pressure, evictions, DNS failures, RBAC misconfiguration

Project: OpenTelemetry tracing across all e-commerce microservices, Tempo as backend, Grafana trace explorer showing end-to-end request flow, SLO dashboards with error budget burn rate, multi-window AlertManager rules for order service


Service Mesh + Network Policies + Zero Trust

  • Service mesh fundamentals — why it exists, what problems it actually solves
  • Istio installation and architecture — control plane, data plane, sidecars
  • mTLS between all microservices — automatic, no code changes
  • Traffic management — VirtualService, DestinationRule, Gateway
  • Canary deployments with Istio traffic splitting — 10% to new version
  • Visualising service mesh traffic with Kiali
  • Network Policies for zero-trust pod-to-pod communication
  • Egress controls and namespace isolation
  • Pod topology spread constraints for multi-AZ resilience
  • Istio observability — built-in metrics, tracing integration with Jaeger

Project: Istio deployed on e-commerce EKS cluster, mTLS enforced between all microservices, canary deployment for order service routing 10% traffic to v2, Kiali showing live traffic topology, network policies blocking all non-essential pod communication


Karpenter, EKS Auto Mode + Cost Optimisation

  • Karpenter architecture — how it differs from Cluster Autoscaler
  • NodePool and EC2NodeClass configuration
  • Cost optimisation with Spot + On-Demand mixed node fleets
  • Karpenter bin packing and consolidation policies — removing underutilised nodes
  • Taints, tolerations, and node selectors with Karpenter
  • EKS Auto Mode — what it is, when to use it over Karpenter
  • Pod topology spread constraints across AZs with Karpenter
  • Kyverno policy enforcement — blocking deployments without resource limits
  • Pod security admission — restricted, baseline, privileged modes
  • Security contexts and pod security standards in practice

Project: Karpenter deployed on EKS, replacing managed node group for inventory service, Spot instances with on-demand fallback, Kyverno policies enforced, blocking any deployment without resource limits and liveness probes, pod security standards applied cluster-wide


Module 5: (3-week) DevSecOps and SRE


  • DevSecOps on Kubernetes — the shift left mindset, where security fits in the SDLC, and why bolt-on security after deployment is a losing strategy
  • Security integrated into the pipeline — SAST with Semgrep for code-level vulnerabilities, DAST with OWASP ZAP against staging, SCA and dependency auditing with Trivy and Grype
  • Pre-commit secret scanning and GitHub secret scanning — catching credentials before they hit the repository, not after
  • Container supply chain security — Trivy blocking critical CVEs before image push, image signing with Cosign, SBOM generation with Syft so you know exactly what is in every image
  • IaC security scanning — Checkov linting Terraform and Kubernetes manifests before apply, failing the pipeline on misconfigurations not just vulnerabilities
  • Secrets management in the pipeline — no hardcoded credentials anywhere, OIDC for AWS auth, External Secrets Operator pulling from Secrets Manager into pods at runtime

  • Falco for runtime threat detection on EKS — custom rules for suspicious process execution, unexpected file access, and anomalous network activity, alerting to Slack in real time
  • Kyverno policy engine — blocking unsigned images, requiring resource limits, enforcing labels and security contexts, validating ingress rules cluster-wide
  • Pod Security Admission — restricted, baseline, and privileged modes, applying the right profile per namespace and understanding what each one actually prevents
  • CIS Kubernetes Benchmark scanning with kube-bench — running the benchmark against the EKS cluster, understanding each failed check, systematic remediation
  • Network policies for zero-trust between namespaces — default deny all, explicit allow only, verifying isolation holds under live traffic

Project — Complete DevSecOps Pipeline

GitHub Actions enforces the full chain before anything reaches EKS: Semgrep scans application code, Checkov lints manifests, Trivy blocks images with critical CVEs, Cosign signs every image that passes. On the cluster side: Kyverno rejects any unsigned image or manifest missing resource limits, Falco monitors runtime behaviour and fires Slack alerts on suspicious activity, network policies enforce zero-trust between namespaces, kube-bench runs on a schedule and posts a remediation report. Every stage has a hard fail — nothing progresses if the gate does not pass.

While I cover the SRE part in the Kubernetes section, the MLOps and AIops section, this part is mainly focused on core SRE

  • SLI, SLO, SLA — precise definitions, writing measurable objectives that mean something, and how error budgets make deployment decisions instead of gut feel
  • DORA metrics — deployment frequency, lead time for changes, MTTR, and change failure rate as a team health signal, and how to actually measure them across the bootcamp stack
  • On-call culture and runbook discipline — what a good runbook looks like, escalation path design, handoff practice, and why most runbooks fail at 2 am
  • Postmortem culture — blameless analysis, timeline reconstruction, contributing factors, and writing RCAs that prevent recurrence rather than assign blame
  • Error budget policy — what happens when the budget is burned, how to freeze deployments, negotiate reliability work vs feature work
  • Kubernetes interview scenarios — the 10 most common live debugging questions asked at the senior level, and how to think through them systematically under pressure
  • Chaos engineering with LitmusChaos — pod failure, network delay, CPU stress, and disk fill experiments with defined steady-state hypotheses and explicit blast radius limits
  • Three live war room simulations drawn from across the bootcamp stack, followed by a written RCA

Module 6: (4-week) MLOps


  • How a model goes from a data scientist’s laptop to production
  • Where platform engineers own the problem vs data scientists
  • MLflow for experiment tracking and model registry — hands-on
  • DVC for data versioning — treating datasets like code
  • Project: Track 3 training runs in MLflow, version the dataset with DVC, promote a model from Staging to Production in the registry

  • Kubeflow Pipelines on EKS — DAG workflows, parameterized runs, cached steps
  • Argo Workflows as an alternative — when to prefer it over Kubeflow
  • GPU node setup on EKS — NVIDIA device plugin, taints, tolerations
  • Spot instances for training via Karpenter with on-demand fallback
  • Project: Build a full training pipeline on Kubeflow — data validation → training → evaluation → model registry promotion. Runs on Spot, falls back to on-demand automatically

  • Serving patterns — batch, real-time REST, async queue, streaming — when each makes sense
  • BentoML for packaging and deploying models on EKS
  • SageMaker real-time endpoints vs self-hosted EKS — honest trade-off discussion
  • KEDA autoscaling on SQS queue depth for async inference
  • Project: Deploy the same model two ways — BentoML on EKS and SageMaker endpoint. Compare latency, cost, and operational overhead side by side

  • Champion vs challenger pattern — automated comparison, promote only if challenger wins on accuracy and latency
  • ML governance — audit trail, who promoted what model, trained on what data
  • Canary model rollout with Istio — 10% to new model, promote on metric threshold
  • Evidently AI as a sidecar for data quality and drift scoring
  • Prometheus custom metrics from inference — confidence scores, prediction volume, latency histograms
  • Cost visibility with OpenCost — per-model attribution, right-sizing serving replicas
  • Project: Full ML CI/CD pipeline — code push triggers training, challenger evaluated against champion, canary rollout via Istio, drift monitoring live with automated retraining

Module 7: (3 weeks) AIOps


  • AIOps from a platform engineer’s perspective — what it is, what it is not, and where LLMs genuinely help operations vs where they add noise
  • LLM deployment options on AWS — Bedrock for managed models (Claude, Llama, Titan), SageMaker JumpStart, and self-hosted open-source models on EKS with vLLM
  • Prompt engineering for operations — structured outputs, chain-of-thought for incident reasoning, avoiding hallucinated commands in production contexts
  • LLM security in operations — prompt injection risks, guardrails, input validation, why an LLM with AWS credentials needs hard constraints
  • LLM-driven cost optimisation — analysing Cost Explorer output, generating rightsizing recommendations, flagging idle resources with supporting evidence
  • Running a local LLM for development — testing prompts offline before hitting Bedrock, cost discipline from day one
  • Building custom MCP servers — creating cost reports, provisioning, and deleting AWS resources from natural language prompts

  • Why a general-purpose LLM knows nothing about your infrastructure — the case for RAG over fine-tuning in operations contexts
  • Embedding models and vector databases — pgvector on RDS and OpenSearch vector engine, understanding chunking strategies for technical documents
  • Building a RAG system that indexes all runbooks, postmortems, and architecture decision records from the project — the operational knowledge base
  • FastAPI inference service on EKS — accepts incident descriptions, retrieves the top relevant documents, calls Bedrock, returns a structured response with probable cause, suggested commands, and runbook link
  • Continuous re-ingestion triggered by a Git webhook when runbooks are updated — keeping the knowledge base current automatically
  • Evaluating RAG quality — retrieval precision, answer relevance, and knowing when your system is confidently wrong

  • Agentic AI architecture for operations — ReAct pattern, tool calling, and decision loops with constrained action spaces
  • LLM-powered incident triage — AlertManager webhook triggers a Lambda agent that queries Loki for recent logs, queries Prometheus for metrics, retrieves relevant runbook sections from the RAG system, and posts a structured Slack message with severity, probable cause, and first three recommended actions in under 60 seconds
  • Tool calling design for operations — what tools the agent can call, what it cannot, and why the action space must be explicitly bounded
  • Automated remediation with guardrails — blast radius limits, rollback triggers, maximum actions per incident, hard stops the agent cannot override
  • Audit logging for every automated action — who triggered it, what the LLM reasoned, what action was taken, what the outcome was
  • Dry-run mode — agent proposes actions, human approves, then executes — building trust before going fully autonomous
  • Self-healing infrastructure patterns — detect, decide, act, and verify — the full loop and where each stage can go wrong


Bonus: Getting ready for the job market

How senior engineers talk about reliability in interviews — cost vs reliability vs complexity framing, presenting war room experience as production credibility

Recorded mock interviews — full recorded mock interview sessions covering DevOps, Kubernetes, MLops, and system design questions shared with all students.

Group sessions — live group interview practice sessions where students interview each other with feedback from Akhilesh. Resume framework — action verbs, metric-driven project impact statements, and how to present bootcamp work as production experience to a hiring manager.

LinkedIn and resume review by Akhilesh — personal feedback, not automated.

Referrals to opportunities in the LivingDevOps network where there is a genuine fit.

GitHub portfolio cleanup — README structure, architecture diagrams, ADRs, and decision logs that show engineering judgment, not just code.


What will you get out of this Advanced DevOps, MLOps & AIOps Bootcamp on AWS

  • 25 weeks of live instruction · 50 classes, 3 hours each, and 1 bonus week of Interview prep
  • Pre-bootcamp Linux + AWS recorded sessions sent on enrolment
  • You will build 20+ real-world projects, debug live production failures, and walk away with a GitHub portfolio that shows engineering judgment — not just tutorial code.
  • Lifetime access to all recordings, code, notes, and resources
  • LivingDevOps Discord — cohort community + alumni network
  • Referrals to relevant opportunities in the network where there is a genuine fit
  • Certificate of completion — DevOps + MLops + AIops

This curriculum follows a logical, incremental learning path from Linux fundamentals to advanced Kubernetes projects, ensuring each concept builds upon previous knowledge

Reach out for Queries
  • Email:livingdevops@gmail.com
  • WhatsApp: +91 9259681620

Reach out for Queries, Part payment requests

70k INR

Testimonials

Balmiki Badatya

Livingdevops bootcamp covers everything Linux, Docker, EKS, Terraform, Python, and it's all structured so well that nothing feels rushed. The best part is how practical it is every concept comes with real examples that actually make sense instead of just slides and definitions.

Sandeep

I have recently completed the AWS DevOps Bootcamp, and it has been a great experience for my career. The program is well-structured, starting from foundational cloud concepts and progressing into advanced DevOps practices like CI/CD pipelines, infrastructure as code, and containerization. Throughout the bootcamp Akhilesh was very supportive and always ready to clarify doubts and provide industry insights. I would highly recommend this bootcamp to anyone looking to build or advance their career in cloud and DevOps.

Hemant kumar

​"I recently completed the DevOps bootcamp from Akhilesh, and it was a game-changer. His ability to break down complex concepts into actionable steps made the material incredibly easy to digest. I walked away not just with new knowledge, but with the confidence to apply it immediately to my role."

Ameet Khemani

Akhilesh, you are one of the best mentor in today's time. I really learned new things with clear cut understanding and that also in reference with real world examples. I would definitely recommend this course to anyone who want to do better in DevOps.

Avinash V

As a senior Devops professional with 12+ years with mostly in the legacy enterprise environments, Akhilesh’s Devops bootcamp was the ideal bridge to cloud-native mastery with production focused training and projects, live troubleshooting etc. I would highly recommend his bootcamps to anyone who are serious to learn and excel in the Devops field.

Varsha Gore

Akhilesh has provided structured DevOps course details right from the beginning. I could see the detail oriented approach and his sincerity throughout those sessions. He was able to show what to expect and how to troubleshoot. The additional resources were also very helpful.

Sajitha

Your structure of topics & teaching method are really great. This help us to understand the realworld infrastructure and daytoday activities in devops well. Thankyou AKhilesh for sharing knowledge & experience.

Gaurav Dubey

One of the best Devops Project Course. Thanks Akhilesh. I loved the real time troubleshooting part, i hav never seen someone do this

Balmiki Badatya

Livingdevops bootcamp covers everything Linux, Docker, EKS, Terraform, Python, and it's all structured so well that nothing feels rushed. The best part is how practical it is every concept comes with real examples that actually make sense instead of just slides and definitions.

Sandeep

I have recently completed the AWS DevOps Bootcamp, and it has been a great experience for my career. The program is well-structured, starting from foundational cloud concepts and progressing into advanced DevOps practices like CI/CD pipelines, infrastructure as code, and containerization. Throughout the bootcamp Akhilesh was very supportive and always ready to clarify doubts and provide industry insights. I would highly recommend this bootcamp to anyone looking to build or advance their career in cloud and DevOps.

Hemant kumar

​"I recently completed the DevOps bootcamp from Akhilesh, and it was a game-changer. His ability to break down complex concepts into actionable steps made the material incredibly easy to digest. I walked away not just with new knowledge, but with the confidence to apply it immediately to my role."

Ameet Khemani

Akhilesh, you are one of the best mentor in today's time. I really learned new things with clear cut understanding and that also in reference with real world examples. I would definitely recommend this course to anyone who want to do better in DevOps.

Avinash V

As a senior Devops professional with 12+ years with mostly in the legacy enterprise environments, Akhilesh’s Devops bootcamp was the ideal bridge to cloud-native mastery with production focused training and projects, live troubleshooting etc. I would highly recommend his bootcamps to anyone who are serious to learn and excel in the Devops field.

Varsha Gore

Akhilesh has provided structured DevOps course details right from the beginning. I could see the detail oriented approach and his sincerity throughout those sessions. He was able to show what to expect and how to troubleshoot. The additional resources were also very helpful.

Sajitha

Your structure of topics & teaching method are really great. This help us to understand the realworld infrastructure and daytoday activities in devops well. Thankyou AKhilesh for sharing knowledge & experience.

Gaurav Dubey

One of the best Devops Project Course. Thanks Akhilesh. I loved the real time troubleshooting part, i hav never seen someone do this