25-Week AWS DevOps + MLOPS + AIOPS Bootcamp with Real World Projects

Batch Details

Demo Classes

India

Save

70k

Global

Save

$750

Early Bird ends June 25th

Advanced DevOps Bootcamp with AWS, ECS, Python, Kubernetes, MLOps, AIOps, Devsecops & SRE (Real-World Projects)

By the end of this bootcamp, you will be able to design, deploy, and operate production-grade infrastructure on AWS — from containerised applications on EKS to ML model pipelines with drift detection to AI-powered incident triage, and a lot more
Bootcamp is mainly focused on building real-world projects with production-level details and live troubleshooting
Every week is built around how senior DevOps and platform engineers actually work.
Whether you are transitioning into DevOps or a senior engineer moving into cloud-native, MLOps, and AIOps, this bootcamp gives you the production experience that recruiters are asking for in 2026

Pre-requisites: Basic Linux and AWS knowledge (I will provide you with resources/past recordings to go through before the class)

Module 1: (2-week) Getting comfortable with AWS

Week 1: Getting comfortable with AWS (VPC, S3, EC2, and other important stuff)

This week is about getting comfortable with my teaching style while learning the basics about AWS

Class 1

Creating a custom VPC and setting up full networking (private and public access)
setting up Nat for outbound only access
Understanding how people use AWS in production
Using IAM roles, inline/custom policies
Bastion host pattern — why you never SSH directly to private instances
IAM Instance Profile for S3 access, SSH only via Bastion
VPC Endpoints — S3 Gateway endpoint, why it saves cost and improves security
EC2 login without SSH port and public IP (with SSM system manager)
VPC Peering vs Transit Gateway — connecting VPCs

Project: Production VPC from Scratch 6 subnets (2 public, 2 private app, 2 private DB) across 2 AZs, NAT Gateway, S3 Gateway endpoint, route tables for each tier, Security Groups per tier with least-privilege rules, NACLs on DB subnets, Bastion host — SSH to private instance through it, direct access blocked

Class2: Storage

Discussing how different storage types are used in the real world
S3 storage classes, lifecycle policies, versioning, bucket policies
Static website hosting, pre-signed URLs, S3 event notifications
CloudFront — CDN, Origin Access Control, cache invalidation
S3 replication — SRR vs CRR, IAM roles, delete marker replication

Project 1: Static App on S3 + CloudFront + Custom Domain + SSL S3 with OAC, CloudFront distribution, ACM cert in us-east-1, Route 53 alias record.

Project 2: Real-Time Cross-Region Replication Source bucket ap-south-1 → destination us-east-1, versioning on both, lifecycle policy on destination

Week 2: Running a 2-tier app on AWS with VPC, EC2,

Class 1: Running a PostgreSQL-backed 2-tier app on EC2

Install PostgreSQL on EC2
Running a 2-tier app on EC2 with a production-grade setup
Elastic IP, Route 53 hosted zones, A records, alias records
Let’s Encrypt free SSL cert — DNS validation, attaching to resources
Elastic IP, Route 53 A record, HTTPS
nginx for reverse proxy, free SSL certs from Let’s Encrypt — HTTPS end-to-end
Using AI copilots for faster workflow and troubleshooting

Project: Running a flask app on EC2 + nginx reverse proxy +Elastic IP + DNS + Free SSL

Class 2: Auto scaling groups, load balancers to run a RDS-backed 2-tier app

– ALB vs NLB — when to use which, listeners, rules, target groups, health checks

– ALB path-based routing, HTTPS termination, HTTP → HTTPS redirect Launch Templates, ASG — desired/min/max, AZ balancing

– Scaling policies — target tracking, step scaling, scheduled User data in ASG — pulling config from SSM on launch, Connection draining, instance refresh for zero-downtime AMI updates

Project: 2-Tier App with ALB + ASG + RDS + Route 53 + ACM Launch Template with User data, ASG across 2 private subnets, min 2 max 6 ALB with HTTPS, health checks, target tracking on CPU RDS PostgreSQL in private DB subnets, credentials in SSM Parameter Store. Simulate — terminate instance, scale under load, verify zero downtime

Project: Running a 2-tier app on EC2 with autoscaling, Load balancers

Module 2: ( 4-week ) Running Containers on production + Terraform + GitHub Action CICD

Week 3: Running container in production

Docker + running a two-tier App on ECS (Console)

Docker fundamentals and how it works in production
Building, sharing, and running custom Docker images
Container networking, volume management with meaningful use cases
Docker Compose for multi-container applications, app dependencies, and health checks
Migrating the app running on a VM to containers
Migrating a Dockerised app running on EC2 to ECS, planning, design, and execution
ECS fundamentals — clusters, services, task definitions, Fargate vs EC2 launch type
Task definition anatomy — container definitions, CPU/memory, environment variables, secrets
ECS service — desired count, deployment config, circuit breaker
ALB integration with ECS — target group, dynamic port mapping, health checks
ECS IAM — task role vs task execution role, what each one does
Secrets Manager + ECS — injecting secrets into containers without hardcoding
VM App Migration to AWS ECS: Migrating a containerised app running on EC2 to ECS Fargate — move secrets to Secrets Manager, upload files to S3, and cut over traffic from EC2 to ECS using Route 53 weighted routing with zero downtime.
ECS Fargate cluster, task definition for Flask app connecting to RDS PostgreSQL
ALB with HTTPS, health check on /health, CloudWatch log group per container
Secrets Manager for DB credentials injected at runtime, task role for S3 access

Project: Deploy a RDS Database-Backed 2-Tier App on ECS

Week 4: ECS with Terraform + Multi-Environment

Class 3: Full ECS Infrastructure with Terraform

Terraform architecture — providers, resources, state, plan, apply lifecycle
Writing your first resources — VPC, EC2, S3 in Terraform
Variables, locals, outputs — making configs reusable
Remote state — S3 backend, why local state breaks teams
terraform.tfvars and environment-specific variable files
terraform import — bringing existing resources under Terraform management
Common mistakes — hardcoded values, missing state locking, giant single files
Terraform module structure for ECS — VPC, ALB, ECS, RDS as separate modules
ECS service and task definition in Terraform — every config option explained
ALB in Terraform — listeners, rules, target groups, ACM cert attachment
RDS in Terraform — subnet groups, parameter groups, Multi-AZ toggle per environment
Terraform for_each and count — creating multiple similar resources cleanly
depends_on and resource ordering — avoiding race conditions on apply
Terraform module versioning — pinning modules for stability

Project: Dev and Production ECS Environments with Terraform

Single module codebase, two workspaces — dev (single AZ, smaller instances) and prod (Multi-AZ, larger)
Full stack — VPC, ALB, ECS Fargate, RDS, ACM, CloudWatch — all in Terraform
terraform plan output reviewed and applied cleanly for both environments

Class 4: Git, GitHub + CI/CD Fundamentals

Git fundamentals — commits, branches, merge vs rebase, resolving conflicts
Branching strategies — Gitflow vs trunk-based, what real teams actually use
GitHub Actions architecture — workflows, jobs, steps, runners
Triggers — push, pull_request, workflow_dispatch, schedule
Matrix build testing across Node 18 and Node 20
GitHub action authentication with AWS
Secrets and environment variables in GitHub Actions — repo secrets vs environment secrets
GitHub Environments — approval gates before deploying to production
GitHub Actions matrix builds — testing across multiple versions in parallel
SDLC and Jira — how tickets flow from backlog to deployed feature in real teams

Project: Multi-Stage CI Pipeline with Automated Testing

GitHub Actions pipeline — lint → unit test → build Docker image → push to ECR
Pull request check — pipeline must pass before merge is allowed

Week 5: Automated Deployments + Scaling + Monitoring

Class 5: Automated ECS Deployments with GitHub Actions

Container image versioning — git SHA tagging, semantic versioning, latest anti-pattern
ECR lifecycle policies — cleaning up old images automatically
ECS deployment strategies — rolling update, blue-green via CodeDeploy
GitHub Actions ECS deploy — aws-actions/amazon-ecs-deploy-task-definition
Environment-specific workflows — dev deploys on merge to main, prod requires approval
Rollback strategy — redeploying the previous task definition on failure
Deployment notifications — Slack alerts on success and failure

Project: Automated build and Deployment Pipeline for ECS

GitHub Actions builds an image on every merge, tags with the git SHA, and pushes to ECR
Automatic rollback if health checks fail within 5 minutes
Email notification on deployment success, failure, and rollback

Class 6: ECS Auto Scaling + Load Testing + Monitoring

ECS service auto scaling — Application Auto Scaling, target tracking on CPU and ALB request count
ECS task-level scaling vs service-level scaling — understanding the difference
CloudWatch custom metrics — pushing app-level metrics from containers
CloudWatch dashboards — ECS service health, ALB latency, RDS connections in one view
CloudWatch alarms — composite alarms, alarm actions (SNS → email/Slack)
AWS X-Ray for distributed tracing on ECS — enabling, reading traces
Load testing with k6 or hey — simulating real traffic, finding bottlenecks
Reading CloudWatch Container Insights under load — what to look for

Project: Load Test + Auto Scaling + Monitoring Dashboard

Target tracking policy — scale ECS tasks when ALB request count per target exceeds threshold
Run k6 load test, watch tasks scale out, and verify ALB distributes traffic
CloudWatch dashboard showing ECS CPU, ALB 5xx rate, RDS connection count, p99 latency
Composite alarm — fires a Slack alert when both high CPU and high error rate occur together

Week 6: Advanced Terraform + Security + OIDC

Class 7: Three-Tier App with Advanced Terraform

Advanced Terraform modules — public registry vs private, module composition patterns
Data sources — referencing existing resources without hardcoding ARNs
Multi-environment strategy — dev, prod with shared modules and separate state files
Importing existing EKS resources into Terraform state
CloudFront in Terraform — distribution, origins, cache behaviours, OAC for S3
RDS in Terraform — automated backups, snapshot retention, Multi-AZ for prod
RDS Disaster recovery, cross-region Read Replica, automated snapshot copy
Terraform lifecycle blocks — prevent_destroy, create_before_destroy, ignore_changes
Terraform drift detection — terraform plan in CI to catch manual changes

Project: Three-Tier App with DR Strategy in Terraform

Full three-tier stack — CloudFront → ALB → ECS → RDS across dev, staging, prod
RDS Multi-AZ — synchronous replication, automatic failover
Read Replicas — async replication, read scaling, cross-region DR
Multi-AZ vs Read Replica — the difference most engineers get wrong
RDS snapshots, point-in-time recovery, Aurora basics, RDS Proxy
DR patterns — Backup & Restore, Pilot Light, Warm Standby, Multi-Site Active-Active
RDS proxy for connection pooling

Class 8: OIDC for Github Action + reusable Github Action workflows

Why access keys in CI/CD are dangerous — rotation burden, leak risk, audit gaps
OIDC fundamentals — how GitHub proves its identity to AWS without a password
Setting up OIDC provider in AWS IAM — thumbprint, audience, provider URL
Keyless Terraform in CI — aws-actions/configure-aws-credentials with OIDC
Fine-grained OIDC conditions — locking roles to specific repos, branches, environments
Full end-to-end deploy — code push → OIDC auth → Terraform plan → ECS deploy — zero static credentials anywhere

Project:

Writing reusable workflows — workflow_call, composite actions
Setting up OIDC to avoid long-lived Credentials for CI/CD (GitHub Action) workflows

Module 3: (3-week) Python for Devops

Week 7: Python Foundations + AWS Automation with Boto3

Class 1: Python for DevOps + Boto3 Deep Dive

Python environment setup, virtual environments, and project structure
Data structures DevOps engineers actually use — dicts, lists, sets for parsing API responses
os and subprocess modules — running shell commands, reading system state from Python
File operations, JSON/YAML parsing
Error handling and exception management
Creating reusable Python modules and a custom CLI
Working with JSON and YAML — parsing, validating, transforming config and API responses
Python logging best practices — levels, formatters, rotating file handlers

Project: AWS Resource Audit CLI Build a production-grade Python CLI that connects to AWS via boto3, paginates through EC2, S3, and RDS resources across regions, generates a formatted report of all running resources with costs, and logs every operation. Runs on a schedule or on demand.

Class 2: Working with API’s + CRUD operations

Uses requests module to make API CRUD requests (Create, Read, Update, Delete)
Implements proper error handling for API calls
Handles API authentication and headers
Parses and validates API responses
boto3 architecture — sessions, clients vs resources, regions, profiles
Paginating through AWS APIs with get_paginator — why it matters at scale
Environment-based config management — .env, os.environ, secrets handling
Error handling and exception management for AWS API calls

Project: API CRUD Automation Script Python script using the requests module to make full CRUD operations against a REST API — proper auth headers, error handling, response validation, and structured logging for every call.

Week 8: AWS Lambda + Serverless Automation

Class 3: Lambda — Event Processing + Lambda layers

Lambda execution model — cold starts, warm starts, concurrency limits, what they cost
Function anatomy — handler, event object, context object
IAM roles and least-privilege security for Lambda
Triggers — S3, SQS, SNS, EventBridge cron and event-based
Environment variables and secrets management in Lambda
Lambda deployment — zip packaging, console vs CLI
CloudWatch Logs integration and structured logging from Lambda
Lambda Layers — packaging dependencies, sharing code across functions
Integration with SQS, SNS, and S3 for event processing
EventBridge for event routing and processing
Error handling between Lambda stages — retries, DLQ, alerting
Lambda concurrency management for high-throughput pipelines
Monitoring pipeline health with CloudWatch metrics and alarms
Dead letter queues for failed invocation handling

Project 1: IAM Key Rotation Lambda Lambda function on an EventBridge schedule that scans all IAM users, identifies keys older than 90 days, rotates them, stores new keys in Secrets Manager, and sends an SES email report with key ages and rotation status.

Project 2: Daily Cloud Cost Report Lambda EventBridge-triggered Lambda that pulls Cost Explorer data via boto3, formats a per-service cost breakdown, compares against last week, and emails the report via SES every morning.

Class 4: ClamAV File Scanning Automation for S3 Security

subprocess for running system tools from Python — the right way
ClamAV setup, freshclam for virus DB updates, scan result parsing via return codes
S3 event notification → SQS → Python consumer pattern end to end
Downloading files from landing bucket, scanning locally, routing based on result
S3 object tagging — Clean/Infected with put_object_tagging
Multi-account AWS architecture — landing account vs clean account
SES email alerts for infected files with full metadata
Production error handling — ClamAV crash, S3 download failure, malformed SQS message

Project: Banking Compliance File Scanner S3 upload triggers SQS message → Python consumer downloads file → ClamAV scans → tags object Clean or Infected → routes clean files to processing bucket → blocks infected files → SES alert to security team with filename, bucket, timestamp, and full scan output. Full logging, retry logic, and dead letter queue for failed scans.

Week 9: Lambda Advanced + FinOps Automation + RDS Migration

Class 5: FinOps Automation + Cost Optimization Scripts

API Gateway + Lambda integration — proxy vs non-proxy, request/response mapping
Multi-Lambda orchestration patterns — chaining, fan-out, fan-in
Lambda Cost optimization — right-sizing memory, reducing cold starts on lambda functions
RDS cost breakdown — instance type, storage, IOPS, Multi-AZ, snapshots
boto3 for pulling RDS and EC2 metrics via CloudWatch — CPU, connections, storage utilization
Identifying rightsizing candidates — underutilized instances, oversized storage
EC2 cost analysis — finding idle instances, stopped instances still costing money
Building a FinOps report that runs weekly and emails recommendations

Project:

– Project: Multilevel Image Processing Pipeline Client uploads image → S3 trigger → Lambda 1 validates format and size → Lambda 2 transforms (resize, watermark, convert format) → stores to clean bucket → Lambda 3 sends SES notification with processed image link. Full error handling, DLQ for failed images, CloudWatch dashboard showing pipeline health.

– RDS usage report generation for FinOps analysis on cost-saving opportunities

Class 6: RDS Migration Automation

Migration planning — pre-migration health check script, validating source DB before starting
pg_dump and pg_restore from Python subprocess — full and schema-only dumps
pgsync for live data migration with minimal downtime
Data validation post-migration — row counts, checksum comparison, schema diff
Migration rollback strategy — when to cut back, how to keep the source alive
Post-migration validation script — automated integrity checks with SES report
Containerizing the migration script with Docker — Dockerfile, entrypoint, env vars
Deploying a migration job on ECS Fargate as a one-off task, not a long-running service
Lambda trigger for the ECS task — one-click or scheduled migration kickoff

Project: End-to-End RDS Migration Platform Python migration script containerized with Docker, deployed to ECS Fargate as a one-off task. Lambda triggers the migration on demand, pgsync runs inside the container with source and target RDS connections, post-migration validation Lambda runs row count and checksum checks, SES sends status updates at each stage. Production-ready and reusable for any PostgreSQL migration on AWS.

Module 4: (6 Weeks) of Kubernetes on AWS

Week 10: Kubernetes Fundamentals + Local Clusters

Kubernetes Architecture + Core Concepts

The why behind Kubernetes — what broke before it existed
Control plane deep dive — API server, etcd, scheduler, controller manager
Worker node components — kubelet, kube-proxy, container runtime
Core objects — Pod, ReplicaSet, Deployment, Service
Setting up Minikube locally, kubectl basics and everyday commands
YAML manifests in depth — apiVersion, kind, metadata, spec
ConfigMaps and Secrets — creating, mounting as env vars and volumes
Namespaces and resource organisation
Labels, selectors, annotations — how Kubernetes finds things
Resource requests and limits — why they matter in production
Kubernetes DNS and service discovery internals
ImagePullSecrets for private registries
Lens (Freelens) — Kubernetes IDE for visual cluster management

Project: Deploy a 2-tier e-commerce app (frontend + PostgreSQL) on Minikube — wired together with Services, ConfigMaps, Secrets, private image registry

Resilience Patterns, Autoscaling + Live Debugging

Liveness, readiness, and startup probes — real failure scenarios
Rolling upgrades and rollback strategies
HPA and VPA — pod autoscaling based on CPU/memory/custom metrics
Init containers and sidecar patterns
Pod Disruption Budgets for zero-downtime deployments
Deployment strategies — Recreate vs RollingUpdate vs Blue-Green
CrashLoopBackOff, OOMKilled — live debugging techniques
Resource quotas and LimitRanges per namespace
Reading Kubernetes events to diagnose failures fast
StatefulSets intro — ordered deployment, stable network identity
DaemonSets and Jobs — when to use each

Project: Add HPA, probes, and PodDisruptionBudget to the e-commerce app. Simulate CrashLoopBackOff and OOMKilled failures live and debug them. Add a PostgreSQL StatefulSet with persistent local storage

Week 11: CI/CD with OIDC, GitOps + Production EKS Foundation

GitOps with ArgoCD + CI/CD Pipeline on Minikube

GitOps fundamentals — why GitOps over push-based deployments
ArgoCD setup on Minikube — apps, sync policies, health checks
End-to-end CI/CD pipeline — GitHub Actions builds image, ArgoCD deploys
ArgoCD app-of-apps pattern intro
Branching strategy for GitOps — app repo vs config repo separation
Rollback with ArgoCD — one-click vs automated
Basic Prometheus + Grafana on Minikube — request rate, pod health dashboards
Debugging failed ArgoCD syncs — common causes and fixes
Multi-environment GitOps intro — dev vs prod namespaces on same cluster

Project: GitHub Actions pipeline builds and pushes e-commerce image on every commit, ArgoCD auto-deploys to Minikube, basic Grafana dashboard showing pod health and request rate, rollback demonstrated live

Production EKS Setup + Networking + Security Foundations

EKS cluster setup via eksctl and AWS console
EKS add-ons — VPC CNI, CoreDNS, EBS CSI Driver, kube-proxy
IRSA — Kubernetes to AWS IAM with OIDC, no hardcoded credentials
AWS Load Balancer Controller with Helm — architecture and annotations
Ingress for internal and external traffic routing
ExternalDNS for automatic Route53 record management
Domain, SSL/TLS termination with ACM
EKS managed node groups vs self-managed nodes — when to use which
EKS access entry for cluster access

Project: EKS cluster up with eksctl, AWS Load Balancer Controller and ExternalDNS deployed via Helm, RBAC hardened, custom domain with SSL termination working

Week 12: 3-Tier App on EKS + Docker image optimization

Running 3-Tier App on EKS + AWS Integrations

Running 3-tier app — frontend + backend + RDS PostgreSQL on EKS
Database migrations using Kubernetes Jobs
Init containers for DB connection readiness checks
IRSA in practice — backend pod accessing Secrets Manager without credentials
AWS Secrets Manager integration — External Secrets Operator pattern
Ingress rules for routing traffic to frontend vs backend
Health checks at load balancer level vs pod level
Blue-Green deployment on EKS with weighted routing
Namespace strategy for multi-tier apps
Real troubleshooting — ImagePullBackOff, pending pods, service not reachable

Project: Full 3-tier app on EKS — frontend + Node.js backend + RDS PostgreSQL, IRSA for Secrets Manager, DB migration Job, custom domain, SSL, live troubleshooting of staged failures

StatefulSets, Persistent Storage + Docker Image Optimisation

StatefulSets deep dive — production patterns and failure recovery
PersistentVolume, PVC, StorageClass — static vs dynamic provisioning on EKS
EBS vs EFS — choosing the right storage for the workload
Headless Services for StatefulSet DNS resolution
Troubleshooting multi-attach volume errors and common StatefulSet failures
Volume snapshots and backup strategies on EKS
Multi-stage Docker builds — drastically smaller production images
Distroless and minimal base images for attack surface reduction
Docker image optimisation — layer caching, build context, .dockerignore

Project: Add MinIO as a StatefulSet with persistent EBS storage to the e-commerce app for product image uploads. Rebuild all images with multi-stage builds, integrate Trivy in GitHub Actions, and reduce image sizes by 60%+

Week 13: Microservices on K8s + Terraform EKS + GitOps

Production EKS with Terraform

Production EKS cluster with Terraform — VPC, subnets, node groups, and add-ons
Terraform module structure for EKS — separation of concerns
Managing dev/staging/prod with Terraform workspaces
Deploying AWS Load Balancer Controller and ExternalDNS via Terraform
IRSA setup via Terraform — no manual console steps
Terraform drift detection on EKS infrastructure
Node group configuration — instance types, spot vs on-demand, taints and tolerations
EKS upgrade strategy with Terraform — node group rotation

Project: Rebuild the entire EKS cluster from scratch with Terraform — VPC, node groups, add-ons, IRSA, Load Balancer Controller, and ExternalDNS all provisioned via code. Zero manual console steps

Microservices on EKS + Advanced ArgoCD GitOps

Microservices design principles — bounded context, single responsibility
Splitting monolith into microservices — frontend, order, inventory, and user service
Inter-service communication — ClusterIP vs headless vs service mesh
Network Policies for microservice traffic isolation between namespaces
Gateway API — advanced ingress routing vs traditional Ingress
ArgoCD App-of-Apps pattern — managing many services cleanly
ArgoCD ApplicationSet for environment promotion across dev/staging/prod
Matrix builds in GitHub Actions for multiple microservices

Project: E-commerce app split into 4 microservices each with own Helm chart and ArgoCD Application, App-of-Apps managing all services, Gateway API routing, matrix CI/CD builds, OpenCost showing per-service spend

Week 14: Full Observability Stack

Metrics, Logs + Dashboards

How observability works in real production companies
Prometheus — metrics collection, PromQL, scrape configs
Prometheus Operator and ServiceMonitor CRDs
Loki for log storage and querying — LogQL basics
Fluent Bit on EKS — log aggregation, filtering, routing to Loki
Grafana dashboards — Kubernetes cluster, app metrics, AWS resource metrics
AlertManager — routing alerts to Slack and PagerDuty with grouping and silencing
CloudWatch Container Insights integration alongside Prometheus
Monitoring differences — Fargate vs managed node groups
Cost visibility dashboard — RDS, Lambda, EKS node costs in Grafana

Project: Prometheus + Loki + Grafana + Fluent Bit deployed on EKS, Grafana dashboard showing order volume, error rates, DB query latency, AlertManager fires Slack alert when order service error rate crosses 1%, CloudWatch Container Insights alongside

Distributed Tracing, SLOs + Advanced Alerting

OpenTelemetry for distributed tracing — instrumentation, collectors, exporters
Tracing a request across frontend → order service → inventory service → DB
Jaeger or Tempo as tracing backend — setup and querying
SLO and SLI definitions — what they mean in practice
Error budget dashboards in Grafana — how teams use them day to day
Multi-window, multi-burn-rate alerting for SLOs
AlertManager advanced — inhibition rules, routing trees, deduplication
Runbook links in alerts — connecting alert to action
Log-based alerting in Grafana with Loki rules
Observability for stateful services — what’s different about monitoring databases
Live debugging 10 real Kubernetes interview scenarios — staged failures on the e-commerce cluster
Advanced live troubleshooting — node pressure, evictions, DNS failures, RBAC misconfiguration

Project: OpenTelemetry tracing across all e-commerce microservices, Tempo as backend, Grafana trace explorer showing end-to-end request flow, SLO dashboards with error budget burn rate, multi-window AlertManager rules for order service

Week 15: Service Mesh, Karpenter + Security

Service Mesh + Network Policies + Zero Trust

Service mesh fundamentals — why it exists, what problems it actually solves
Istio installation and architecture — control plane, data plane, sidecars
mTLS between all microservices — automatic, no code changes
Traffic management — VirtualService, DestinationRule, Gateway
Canary deployments with Istio traffic splitting — 10% to new version
Visualising service mesh traffic with Kiali
Network Policies for zero-trust pod-to-pod communication
Egress controls and namespace isolation
Pod topology spread constraints for multi-AZ resilience
Istio observability — built-in metrics, tracing integration with Jaeger

Project: Istio deployed on e-commerce EKS cluster, mTLS enforced between all microservices, canary deployment for order service routing 10% traffic to v2, Kiali showing live traffic topology, network policies blocking all non-essential pod communication

Karpenter, EKS Auto Mode + Cost Optimisation

Karpenter architecture — how it differs from Cluster Autoscaler
NodePool and EC2NodeClass configuration
Cost optimisation with Spot + On-Demand mixed node fleets
Karpenter bin packing and consolidation policies — removing underutilised nodes
Taints, tolerations, and node selectors with Karpenter
EKS Auto Mode — what it is, when to use it over Karpenter
Pod topology spread constraints across AZs with Karpenter
Kyverno policy enforcement — blocking deployments without resource limits
Pod security admission — restricted, baseline, privileged modes
Security contexts and pod security standards in practice

Project: Karpenter deployed on EKS, replacing managed node group for inventory service, Spot instances with on-demand fallback, Kyverno policies enforced, blocking any deployment without resource limits and liveness probes, pod security standards applied cluster-wide

Module 5: (3-week) DevSecOps and SRE

Week 16: Shift Left Security + Pipeline Hardening

DevSecOps on Kubernetes — the shift left mindset, where security fits in the SDLC, and why bolt-on security after deployment is a losing strategy
Security integrated into the pipeline — SAST with Semgrep for code-level vulnerabilities, DAST with OWASP ZAP against staging, SCA and dependency auditing with Trivy and Grype
Pre-commit secret scanning and GitHub secret scanning — catching credentials before they hit the repository, not after
Container supply chain security — Trivy blocking critical CVEs before image push, image signing with Cosign, SBOM generation with Syft so you know exactly what is in every image
IaC security scanning — Checkov linting Terraform and Kubernetes manifests before apply, failing the pipeline on misconfigurations not just vulnerabilities
Secrets management in the pipeline — no hardcoded credentials anywhere, OIDC for AWS auth, External Secrets Operator pulling from Secrets Manager into pods at runtime

Week 17: Runtime Security + Policy as Code + Zero Trust on EKS

Falco for runtime threat detection on EKS — custom rules for suspicious process execution, unexpected file access, and anomalous network activity, alerting to Slack in real time
Kyverno policy engine — blocking unsigned images, requiring resource limits, enforcing labels and security contexts, validating ingress rules cluster-wide
Pod Security Admission — restricted, baseline, and privileged modes, applying the right profile per namespace and understanding what each one actually prevents
CIS Kubernetes Benchmark scanning with kube-bench — running the benchmark against the EKS cluster, understanding each failed check, systematic remediation
Network policies for zero-trust between namespaces — default deny all, explicit allow only, verifying isolation holds under live traffic

Project — Complete DevSecOps Pipeline

GitHub Actions enforces the full chain before anything reaches EKS: Semgrep scans application code, Checkov lints manifests, Trivy blocks images with critical CVEs, Cosign signs every image that passes. On the cluster side: Kyverno rejects any unsigned image or manifest missing resource limits, Falco monitors runtime behaviour and fires Slack alerts on suspicious activity, network policies enforce zero-trust between namespaces, kube-bench runs on a schedule and posts a remediation report. Every stage has a hard fail — nothing progresses if the gate does not pass.

Week 18: SRE 101 + everything that comes with it

While I cover the SRE part in the Kubernetes section, the MLOps and AIops section, this part is mainly focused on core SRE

SLI, SLO, SLA — precise definitions, writing measurable objectives that mean something, and how error budgets make deployment decisions instead of gut feel
DORA metrics — deployment frequency, lead time for changes, MTTR, and change failure rate as a team health signal, and how to actually measure them across the bootcamp stack
On-call culture and runbook discipline — what a good runbook looks like, escalation path design, handoff practice, and why most runbooks fail at 2 am
Postmortem culture — blameless analysis, timeline reconstruction, contributing factors, and writing RCAs that prevent recurrence rather than assign blame
Error budget policy — what happens when the budget is burned, how to freeze deployments, negotiate reliability work vs feature work
Kubernetes interview scenarios — the 10 most common live debugging questions asked at the senior level, and how to think through them systematically under pressure
Chaos engineering with LitmusChaos — pod failure, network delay, CPU stress, and disk fill experiments with defined steady-state hypotheses and explicit blast radius limits
Three live war room simulations drawn from across the bootcamp stack, followed by a written RCA

Module 6: (4-week) MLOps

Week 19: ML Foundations for Platform Engineers

How a model goes from a data scientist’s laptop to production
Where platform engineers own the problem vs data scientists
MLflow for experiment tracking and model registry — hands-on
DVC for data versioning — treating datasets like code
Project: Track 3 training runs in MLflow, version the dataset with DVC, promote a model from Staging to Production in the registry

Week 20: Building the Training Platform

Kubeflow Pipelines on EKS — DAG workflows, parameterized runs, cached steps
Argo Workflows as an alternative — when to prefer it over Kubeflow
GPU node setup on EKS — NVIDIA device plugin, taints, tolerations
Spot instances for training via Karpenter with on-demand fallback
Project: Build a full training pipeline on Kubeflow — data validation → training → evaluation → model registry promotion. Runs on Spot, falls back to on-demand automatically

Week 21: Model Serving

Serving patterns — batch, real-time REST, async queue, streaming — when each makes sense
BentoML for packaging and deploying models on EKS
SageMaker real-time endpoints vs self-hosted EKS — honest trade-off discussion
KEDA autoscaling on SQS queue depth for async inference
Project: Deploy the same model two ways — BentoML on EKS and SageMaker endpoint. Compare latency, cost, and operational overhead side by side

Week 22: ML CI/CD + Observability + Drift Detection

Champion vs challenger pattern — automated comparison, promote only if challenger wins on accuracy and latency
ML governance — audit trail, who promoted what model, trained on what data
Canary model rollout with Istio — 10% to new model, promote on metric threshold
Evidently, AI as a sidecar for data quality and drift scoring
Prometheus custom metrics from inference — confidence scores, prediction volume, latency histograms
Cost visibility with OpenCost — per-model attribution, right-sizing serving replicas
Project: Full ML CI/CD pipeline — code push triggers training, challenger evaluated against champion, canary rollout via Istio, drift monitoring live with automated retraining

Module 7: (3 weeks) AIOps

Week 23: LLM Infrastructure on AWS + Foundations of AIOps

AIOps from a platform engineer’s perspective — what it is, what it is not, and where LLMs genuinely help operations vs where they add noise
LLM deployment options on AWS — Bedrock for managed models (Claude, Llama, Titan), SageMaker JumpStart
Prompt engineering for operations — structured outputs, chain-of-thought for incident reasoning, avoiding hallucinated commands in production contexts
LLM-driven cost optimisation — analysing Cost Explorer output, generating rightsizing recommendations, flagging idle resources with supporting evidence
Running a local LLM for development — testing prompts offline before hitting Bedrock, cost discipline from day one
Building custom MCP servers — creating cost reports, provisioning, and deleting AWS resources from natural language prompts

Week 24: RAG for Operations

Why a general-purpose LLM knows nothing about your infrastructure — the case for RAG over fine-tuning in operations contexts
Embedding models and vector databases — pgvector on RDS and OpenSearch vector engine, understanding chunking strategies for technical documents
Building a RAG system that indexes all runbooks, postmortems, and architecture decision records from the project — the operational knowledge base
Continuous re-ingestion triggered by a Git webhook when runbooks are updated — keeping the knowledge base current automatically
Evaluating RAG quality — retrieval precision, answer relevance, and knowing when your system is confidently wrong

Week 25: Agentic AIOps + LLM-Powered Self-Healing Infrastructure

Agentic AI architecture for operations — ReAct pattern, tool calling, and decision loops with constrained action spaces
LLM-powered incident triage — AlertManager webhook triggers a Lambda agent that queries Loki for recent logs, queries Prometheus for metrics, retrieves relevant runbook sections from the RAG system, and posts a structured Slack message with severity, probable cause, and first three recommended actions in under 60 seconds
Tool calling design for operations — what tools the agent can call, what it cannot, and why the action space must be explicitly bounded
Automated remediation with guardrails — blast radius limits, rollback triggers, maximum actions per incident, hard stops the agent cannot override
Dry-run mode — agent proposes actions, human approves, then executes — building trust before going fully autonomous
Self-healing infrastructure patterns — detect, decide, act, and verify — the full loop and where each stage can go wrong

Bonus: Getting ready for the job market

Resume + Portfolio + Interview Preparation

How senior engineers talk about reliability in interviews — cost vs reliability vs complexity framing, presenting war room experience as production credibility

Recorded mock interviews — full recorded mock interview sessions covering DevOps, Kubernetes, MLops, and system design questions shared with all students.

Group sessions — live group interview practice sessions where students interview each other with feedback from Akhilesh. Resume framework — action verbs, metric-driven project impact statements, and how to present bootcamp work as production experience to a hiring manager.

LinkedIn and resume review by Akhilesh — personal feedback, not automated.

Referrals to opportunities in the LivingDevOps network where there is a genuine fit.

GitHub portfolio cleanup — README structure, architecture diagrams, ADRs, and decision logs that show engineering judgment, not just code.

What will you get out of this Advanced DevOps, MLOps & AIOps Bootcamp on AWS

25 weeks of live instruction · 50 classes, 3 hours each, and 1 bonus week of Interview prep
Pre-bootcamp Linux + AWS recorded sessions sent on enrolment
You will build 20+ real-world projects, debug live production failures, and walk away with a GitHub portfolio that shows engineering judgment — not just tutorial code.
Lifetime access to all recordings, code, notes, and resources
LivingDevOps Discord — cohort community + alumni network
Referrals to relevant opportunities in the network where there is a genuine fit
Certificate of completion — DevOps + MLops + AIops

This curriculum follows a logical, incremental learning path from Linux fundamentals to advanced Kubernetes projects, ensuring each concept builds upon previous knowledge

Reach out for Queries

Email:livingdevops@gmail.com
WhatsApp: +91 9259681620

Testimonials

Fahim Vazir

The bootcamp has helped me understand and apply DevOps concepts and workflows from the base up to advanced. What stood out was that it involved how all of the concepts are applied in production environments for real-world use-cases. It has been beneficial by a great deal really. Piyush (Azure bootcamp instructor from Living Devops) has been a great instructor he has been very helpful, patient and has explained everything in a nice manner.Highly recommended if anyone wants to opt for a DevOps bootcamp!

Abu

I recently completed Akhilesh's 6-month DevOps Bootcamp, and it has been one of the most valuable learning experiences of my career. The bootcamp was highly structured, practical, and focused on real-world industry scenarios rather than just theory. From Linux, AWS, Docker, Terraform, Kubernetes, and CI/CD to advanced DevOps practices, every topic was explained with hands-on projects and live troubleshooting sessions. Akhilesh's mentorship, guidance, and industry insights helped me build both technical skills and confidence. I highly recommend this bootcamp to anyone serious about building a strong career in DevOps.

Balmiki Badatya

Livingdevops bootcamp covers everything Linux, Docker, EKS, Terraform, Python, and it's all structured so well that nothing feels rushed. The best part is how practical it is every concept comes with real examples that actually make sense instead of just slides and definitions.

Sandeep

I have recently completed the AWS DevOps Bootcamp, and it has been a great experience for my career. The program is well-structured, starting from foundational cloud concepts and progressing into advanced DevOps practices like CI/CD pipelines, infrastructure as code, and containerization.Throughout the bootcamp Akhilesh was very supportive and always ready to clarify doubts and provide industry insights. I would highly recommend this bootcamp to anyone looking to build or advance their career in cloud and DevOps.

Hemant kumar

"I recently completed the DevOps bootcamp from Akhilesh, and it was a game-changer. His ability to break down complex concepts into actionable steps made the material incredibly easy to digest. I walked away not just with new knowledge, but with the confidence to apply it immediately to my role."

Ameet Khemani

Akhilesh, you are one of the best mentor in today's time. I really learned new things with clear cut understanding and that also in reference with real world examples. I would definitely recommend this course to anyone who want to do better in DevOps.

Avinash V

As a senior Devops professional with 12+ years with mostly in the legacy enterprise environments, Akhilesh’s Devops bootcamp was the ideal bridge to cloud-native mastery with production focused training and projects, live troubleshooting etc. I would highly recommend his bootcamps to anyone who are serious to learn and excel in the Devops field.

Varsha Gore

Akhilesh has provided structured DevOps course details right from the beginning. I could see the detail oriented approach and his sincerity throughout those sessions. He was able to show what to expect and how to troubleshoot. The additional resources were also very helpful.