QubitSkills - AI Training Platform

What You'll Learn

Configure vLLM and TensorRT-LLM for maximum throughput

Apply speculative decoding and KV cache tuning to cut latency

Build semantic and intent-based routers that select the right model per request

Deploy auto-scaling inference on Kubernetes with KEDA and KServe

Conduct cloud cost audits and optimize spend across model tiers

Course Content

Week 1: Inference Engine Architecture

Understand what happens between the HTTP request and the first token.

Throughput vs. Latency Tradeoffs

Map the inverse relationship between throughput and latency and learn how to tune each independently for different workloads.

Continuous Batching Execution

See how continuous batching fills GPU compute between requests, eliminating idle time and multiplying effective throughput.

vLLM Architecture

Explore PagedAttention, the key innovation inside vLLM, and how it manages KV cache memory like an operating system paging system.

TensorRT-LLM Compilation

Compile models to optimized TensorRT engines with INT8 and FP8 quantization, fused attention kernels, and in-flight batching.

SGLang and Alternative Engines

Evaluate SGLang, MLC-LLM, and ExLlamaV2 as alternatives to vLLM and understand when each excels.

Weekly Win

Benchmarked Inference Engine

Deploy vLLM and TensorRT-LLM side-by-side, run throughput and latency benchmarks, and document the performance delta.

Week 2: Advanced Serving Techniques

Squeeze every millisecond and token out of your serving stack.

Speculative Decoding

Use a small draft model to speculatively generate tokens that the target model verifies in parallel, cutting wall-clock latency.

Serving Quantized Weights

Serve GPTQ, AWQ, and GGUF quantized models without quality regression using engine-native quantization support.

Streaming Response Architectures

Implement server-sent events and WebSocket streaming so users see tokens as they are generated rather than waiting for completion.

Advanced KV Cache Tuning

Tune KV cache block size, eviction policies, and prefix caching to maximize cache hit rates across concurrent sessions.

Cold Start Mitigation

Pre-warm model replicas, use model caching layers, and implement keep-alive strategies to eliminate cold start latency spikes.

Weekly Win

Latency-Optimized Serving Setup

Demonstrate 40%+ latency reduction using speculative decoding and KV cache tuning compared to a vanilla vLLM baseline.

Week 3: Smart Routing Strategies

Send every request to exactly the right model at the right cost.

The Economics of Inference

Model the cost-per-token across model sizes and hardware tiers to build the business case for intelligent routing.

Edge Embedding Generation

Generate lightweight semantic embeddings at the gateway layer without forwarding the full request to a large model.

Similarity Search Routing

Route requests to specialized model variants using nearest-neighbor search over a catalog of domain embedding centroids.

Intent and Complexity Classifiers

Train lightweight classifiers to predict query complexity and intent, routing simple queries to small models and complex ones upstream.

Cascading Escalation Paths

Design multi-tier escalation where a small model handles easy requests and automatically escalates failures to larger models.

Weekly Win

Semantic Router in Production

A working router that classifies incoming requests and directs them to a small or large model based on intent and complexity scores.

Week 4: Kubernetes-Native AI Deployment

Run inference at cloud scale with automated provisioning and cost control.

Kubernetes Native AI

Structure model serving workloads as Kubernetes Deployments, Services, and ConfigMaps following cloud-native AI patterns.

KServe Deployment

Deploy models using KServe's InferenceService CRDs for standardized serving, versioning, and canary rollouts on Kubernetes.

Envoy AI Gateway Integration

Route inference traffic through an Envoy-based AI gateway for load balancing, rate limiting, and per-model observability.

KEDA Event-Driven Autoscaling

Configure KEDA to scale model replicas on queue depth, RPS, or GPU utilization metrics rather than CPU-based HPA.

GPU Time-Sharing and Slicing

Use MIG and time-sharing to partition a single GPU across multiple small model replicas, maximizing hardware utilization.

Weekly Win

Auto-Scaling Inference Cluster

A KServe deployment that scales from zero to N replicas under load and back to zero at idle, with GPU time-sharing enabled.

Week 5: Capstone — Production Inference System

Deliver a stress-tested, cost-audited inference gateway ready for launch.

Capstone: Gateway Architecture

Design the full gateway topology: Envoy → router → model tiers → observability stack, documented as an architecture diagram.

Capstone: Model Registry Integration

Connect the serving stack to a model registry for versioned deployments, rollback capability, and lineage tracking.

Capstone: Stress Testing Concurrency

Run Locust or k6 load tests to find throughput limits, identify bottlenecks, and validate SLA compliance under peak load.

Capstone: Router Refinement

Retrain the intent classifier on real traffic logs and A/B test the updated routing policy against the baseline.

Capstone: Cloud Cost Audit and Launch

Audit per-request cost across model tiers, optimize instance selection, and launch the system with a defined cost budget.

Weekly Win

Production-Launched Inference Gateway

A live, stress-tested inference gateway with semantic routing, auto-scaling, and a documented cost-per-request under target SLA.

Prerequisites

Linux and Docker fluency

Python API development

Basic Kubernetes knowledge

Hands-on Project

Deploy a multi-model inference gateway with semantic routing, stress-tested for concurrency, with a full cost audit and GPU time-sharing configuration.

📚

Advanced Level

Course Price

₹14,999

India

$249

International · One-time payment

Next cohort starts Mar 30

Duration5 weeks

LevelAdvanced

FormatCohort-based

Modules5

What's included:

Live cohort sessions

Hands-on projects

Certificate of completion

Lifetime access

Career support

Part of Learning Track

🏗️

AI Architect

7 courses in track

AI Architect: High-Performance Inference & Smart Routing

What You'll Learn

Course Content

Prerequisites

Hands-on Project

What's included:

Part of Learning Track