Home/Courses/AI Architect/AI Architect: High-Performance Inference & Smart Routing
Advanced CoursePart of AI Architect

AI Architect: High-Performance Inference & Smart Routing

Maximize throughput and minimize latency across your model serving stack. Master vLLM, TensorRT-LLM, speculative decoding, and intelligent request routing — then deploy the full system on Kubernetes with auto-scaling and cost controls.

No rating yet
5 weeks

What You'll Learn

Configure vLLM and TensorRT-LLM for maximum throughput
Apply speculative decoding and KV cache tuning to cut latency
Build semantic and intent-based routers that select the right model per request
Deploy auto-scaling inference on Kubernetes with KEDA and KServe
Conduct cloud cost audits and optimize spend across model tiers

Course Content

W1
Week 1: Inference Engine Architecture
Understand what happens between the HTTP request and the first token.
1
Throughput vs. Latency Tradeoffs
Map the inverse relationship between throughput and latency and learn how to tune each independently for different workloads.
2
Continuous Batching Execution
See how continuous batching fills GPU compute between requests, eliminating idle time and multiplying effective throughput.
3
vLLM Architecture
Explore PagedAttention, the key innovation inside vLLM, and how it manages KV cache memory like an operating system paging system.
4
TensorRT-LLM Compilation
Compile models to optimized TensorRT engines with INT8 and FP8 quantization, fused attention kernels, and in-flight batching.
5
SGLang and Alternative Engines
Evaluate SGLang, MLC-LLM, and ExLlamaV2 as alternatives to vLLM and understand when each excels.
Weekly Win
Benchmarked Inference Engine
Deploy vLLM and TensorRT-LLM side-by-side, run throughput and latency benchmarks, and document the performance delta.
W2
Week 2: Advanced Serving Techniques
Squeeze every millisecond and token out of your serving stack.
1
Speculative Decoding
Use a small draft model to speculatively generate tokens that the target model verifies in parallel, cutting wall-clock latency.
2
Serving Quantized Weights
Serve GPTQ, AWQ, and GGUF quantized models without quality regression using engine-native quantization support.
3
Streaming Response Architectures
Implement server-sent events and WebSocket streaming so users see tokens as they are generated rather than waiting for completion.
4
Advanced KV Cache Tuning
Tune KV cache block size, eviction policies, and prefix caching to maximize cache hit rates across concurrent sessions.
5
Cold Start Mitigation
Pre-warm model replicas, use model caching layers, and implement keep-alive strategies to eliminate cold start latency spikes.
Weekly Win
Latency-Optimized Serving Setup
Demonstrate 40%+ latency reduction using speculative decoding and KV cache tuning compared to a vanilla vLLM baseline.
W3
Week 3: Smart Routing Strategies
Send every request to exactly the right model at the right cost.
1
The Economics of Inference
Model the cost-per-token across model sizes and hardware tiers to build the business case for intelligent routing.
2
Edge Embedding Generation
Generate lightweight semantic embeddings at the gateway layer without forwarding the full request to a large model.
3
Similarity Search Routing
Route requests to specialized model variants using nearest-neighbor search over a catalog of domain embedding centroids.
4
Intent and Complexity Classifiers
Train lightweight classifiers to predict query complexity and intent, routing simple queries to small models and complex ones upstream.
5
Cascading Escalation Paths
Design multi-tier escalation where a small model handles easy requests and automatically escalates failures to larger models.
Weekly Win
Semantic Router in Production
A working router that classifies incoming requests and directs them to a small or large model based on intent and complexity scores.
W4
Week 4: Kubernetes-Native AI Deployment
Run inference at cloud scale with automated provisioning and cost control.
1
Kubernetes Native AI
Structure model serving workloads as Kubernetes Deployments, Services, and ConfigMaps following cloud-native AI patterns.
2
KServe Deployment
Deploy models using KServe's InferenceService CRDs for standardized serving, versioning, and canary rollouts on Kubernetes.
3
Envoy AI Gateway Integration
Route inference traffic through an Envoy-based AI gateway for load balancing, rate limiting, and per-model observability.
4
KEDA Event-Driven Autoscaling
Configure KEDA to scale model replicas on queue depth, RPS, or GPU utilization metrics rather than CPU-based HPA.
5
GPU Time-Sharing and Slicing
Use MIG and time-sharing to partition a single GPU across multiple small model replicas, maximizing hardware utilization.
Weekly Win
Auto-Scaling Inference Cluster
A KServe deployment that scales from zero to N replicas under load and back to zero at idle, with GPU time-sharing enabled.
W5
Week 5: Capstone — Production Inference System
Deliver a stress-tested, cost-audited inference gateway ready for launch.
1
Capstone: Gateway Architecture
Design the full gateway topology: Envoy → router → model tiers → observability stack, documented as an architecture diagram.
2
Capstone: Model Registry Integration
Connect the serving stack to a model registry for versioned deployments, rollback capability, and lineage tracking.
3
Capstone: Stress Testing Concurrency
Run Locust or k6 load tests to find throughput limits, identify bottlenecks, and validate SLA compliance under peak load.
4
Capstone: Router Refinement
Retrain the intent classifier on real traffic logs and A/B test the updated routing policy against the baseline.
5
Capstone: Cloud Cost Audit and Launch
Audit per-request cost across model tiers, optimize instance selection, and launch the system with a defined cost budget.
Weekly Win
Production-Launched Inference Gateway
A live, stress-tested inference gateway with semantic routing, auto-scaling, and a documented cost-per-request under target SLA.

Prerequisites

Linux and Docker fluency
Python API development
Basic Kubernetes knowledge

Hands-on Project

Deploy a multi-model inference gateway with semantic routing, stress-tested for concurrency, with a full cost audit and GPU time-sharing configuration.

📚
Advanced Level
Course Price
14,999
India
$249
International · One-time payment
Next cohort starts Mar 30
Duration5 weeks
LevelAdvanced
FormatCohort-based
Modules5

What's included:

Live cohort sessions
Hands-on projects
Certificate of completion
Lifetime access
Career support

Part of Learning Track

🏗️
AI Architect
7 courses in track