
Most teams running LLMs start with a cloud API. At some point — whether driven by cost, compliance, or latency — the question becomes: should we self-host? And if we do, should we run on a VM or on Kubernetes?
This post answers those questions with specifics. It covers when AKS + GPU inference makes sense, how to choose the right model for your use case, and how to size every layer of the stack: GPU node, pod configuration, and replica count.
When Does GPU Inference on AKS Make Sense?
Option A: Cloud API (Azure OpenAI) No infrastructure. Pay per token. Works immediately. But your prompts transit Microsoft’s inference infrastructure, pricing is per-token at commercial rates, and the GPU capacity is shared with every other tenant.
Option B: Self-hosted on a VM SSH in, run docker pull vllm/vllm-openai, point it at your GPU. Simple. But the VM bills 24/7 regardless of whether you’re inferencing. A T4 VM (NC4as_T4_v3) at ~$0.53/hr costs $380/month running continuously.
Option C: Self-hosted on AKS with NAP + KEDA The GPU node exists only when inference is running. When idle, NAP (Karpenter on AKS) deprovisions the node and GPU billing stops. A workload running 4 hours/day pays ~$50/month instead of $380.
That gap — $330/month per GPU node — is the core economic argument for this stack.
When the API wins
Don’t over-engineer. The cloud API is the right choice when:
- Volume is under ~10K requests/day — at low volume, API simplicity beats infrastructure cost
- You need GPT-4-class multimodal and no open model matches your quality bar
- You have no MLOps capacity — self-hosting requires ~0.25–0.5 FTE to maintain
- You need Microsoft’s compliance certifications (SOC 2, HIPAA BAA) without building them yourself
When AKS wins
- Data sovereignty — prompts and completions never leave your VNet. This is the deciding factor for HIPAA, PCI-DSS, EU AI Act, and customer contracts that prohibit data leaving your environment. When you call Azure OpenAI, your data transits Microsoft’s inference infrastructure. With a self-hosted model in AKS, it never crosses your VNet boundary.
- Cost at volume — the numbers as of 2026:
| Model | Input / 1M tokens | Output / 1M tokens |
|---|---|---|
| GPT-4o (Azure OpenAI) | $2.50 | $10.00 |
| GPT-4o-mini (Azure OpenAI) | $0.15 | $0.60 |
| Self-hosted Mistral 7B (1x T4) | ~$0.004 | ~$0.004 |
| Self-hosted Llama 3.3 70B (2x A100) | ~$0.025 | ~$0.025 |
Self-hosted cost = GPU $/hr ÷ (throughput tok/s × 3,600). Throughput is the key variable — it changes the number significantly.
The break-even formula:
Break-even (req/day) = fixed_daily_overhead ────────────────────────────────────────────────── api_cost_per_req − selfhost_cost_per_reqwhere: fixed_daily_overhead = system node pool $/day = $0.37/hr × 24 = $8.88/day ← vLLM Standalone only. GPU node deprovisions at idle. For KAITO: add GPU node cost = $8.88 + $1.20×24 = $37.68/day api_cost_per_req = (input_tokens × $0.15 + output_tokens × $0.60) / 1,000,000 selfhost_cost_per_req = total_tokens × gpu_$/hr / (throughput_tok_s × 3,600)
Worked example — Mistral 7B vs GPT-4o-mini, 500 input + 300 output tokens avg, 3,000 tok/s:
api_cost_per_req = (500 × $0.15 + 300 × $0.60) / 1,000,000 = $0.000255 / request
selfhost_cost_per_req = 800 × $1.20 / (3,000 × 3,600) = $0.000089 / request
vLLM Standalone (GPU deprovisioned at idle):
break_even = $8.88 / ($0.000255 − $0.000089) ≈ 53,500 requests / day
KAITO (GPU node always running while Workspace exists):
break_even = $37.68 / ($0.000255 − $0.000089) ≈ 227,000 requests / day
Cost curves — vLLM Standalone starts at ~$9/day (system nodes only) and rises slowly; API starts at $0 and rises steeply. KAITO starts at ~$38/day regardless of traffic:

Deployment model determines your cost floor. vLLM Standalone achieves true GPU billing scale-to-zero via NAP — you pay ~$9/day at idle. KAITO sets do-not-disrupt: true on GPU nodes, preventing NAP consolidation while the Workspace CRD exists. Your cost floor with KAITO is ~$38/day regardless of traffic. Use KAITO for workloads with consistent demand; use vLLM Standalone for bursty or dev/test workloads where idle periods are significant.
What shifts the break-even:
| Factor | Direction | Why |
|---|---|---|
| Higher GPU throughput | Break-even drops | Cheaper self-hosted cost per token |
| Output-heavy requests | Break-even drops | API charges 4× more for output than input |
| More expensive API tier (GPT-4o vs mini) | Break-even drops sharply | Larger cost gap per request |
| KAITO deployment (node always-on) | Break-even rises to ~227K req/day | Fixed daily cost jumps from $8.88 to $37.68 |
| Low throughput (<1,200 tok/s) | No break-even — self-host never wins | Self-host cost per token exceeds API before fixed overhead |
Critical caveat: the cost advantage only materializes when the GPU is well-utilized and the GPU node deprovisions during idle (vLLM Standalone). For KAITO, the node runs continuously; the savings come from multi-model sharing and operational simplicity, not idle cost elimination.
KAITO and scale-to-zero — an important nuance: KAITO sets karpenter.sh/do-not-disrupt: "true" on every NodeClaim it creates (nodeclaim.go:151). This blocks NAP consolidation — the GPU node stays running as long as the Workspace CRD exists, even when all replicas are scaled to zero by KEDA. KAITO’s official KEDA integration (docs) scales inference pods only and uses minReplicaCount: 1 in all examples. Community request #306 tracks GPU node scale-to-zero — it has no implementation commitment as of 2026.
do-not-disrupt only blocks voluntary disruption. When the Workspace is deleted, KAITO’s GC finalizer deletes the NodeClaim directly (workspace_gc_finalizer.go), which bypasses the annotation and terminates the node. But this is a slow teardown path (6-8 min to redeploy), not the KEDA replica-scale path.
For true GPU billing scale-to-zero: use vLLM standalone (vllm-standalone.yaml) instead of KAITO. Without the do-not-disrupt annotation, NAP deprovisions the GPU node when KEDA scales replicas to zero. The KAITO model manifests in this repo remain valid for always-on or near-always-on workloads.
- Bursty or unpredictable traffic — KEDA scales from zero replicas when demand arrives, and NAP provisions GPU nodes automatically. No pre-provisioned capacity sitting idle between spikes.
- Multiple models — running Mistral for customer support and Llama for internal agents on the same cluster means one KAITO Workspace CRD per model, sharing the same system node pool. On VMs it’s manual port juggling.
- Customization — fine-tune on your domain data with KAITO’s QLoRA support (a single
kubectl apply). Fine-tuning GPT-4o is restricted, more expensive, and the result stays on Microsoft’s infrastructure. - Latency control — dedicated GPU, predictable P95 TTFT, direct control over serving parameters. Cloud API TTFT spikes during tenant peak hours.
- No vendor lock-in — model versions get deprecated on the API provider’s schedule. With open weights, you pin the version you tested. It runs forever.
Quality context: As of 2026, Llama 3.3 70B scores 86.0% on MMLU (0-shot, CoT) per the Meta model card, vs GPT-4o’s 88.7% on the same variant per OpenAI’s Hello GPT-4o. For most enterprise tasks, the gap between open-source and proprietary models has effectively closed. A fine-tuned smaller model often outperforms a larger general-purpose one on your specific domain.
VM vs AKS: when the complexity pays off
| Dimension | VM (single GPU) | AKS + KAITO + NAP + KEDA |
|---|---|---|
| Setup time | Minutes | ~10 min first time |
| GPU billing | 24/7, always on | Only while inferencing (scale-to-zero) |
| Multi-model | Manual port juggling | One KAITO Workspace CRD per model |
| Scaling to N replicas | Manual | KEDA + NAP handles it |
| Secrets / auth | .env files, SSH keys | Workload Identity — nothing stored |
| Cost at idle | Full GPU VM cost | ~$0 — node deprovisioned |
| RBAC | OS-level | Kubernetes RBAC + Azure RBAC |
Use a VM when: prototyping a single model, running long fine-tuning jobs that can’t tolerate interruption, or you want zero operational complexity.
Use AKS when: multiple models, bursty traffic, compliance requirements, or you already run other workloads on AKS and want to reuse the cluster.
How to Pick the Right Model
Run through these constraints in order. The first one that applies wins.
Step 1: What is your available VRAM?
VRAM is the hard constraint. Before evaluating quality, check what GPU tier you can access:
1x T4 (16 GB) → Phi-4 Mini, Phi-3 Mini, Mistral 7B (tight), Mistral 7B AWQ1x A10 (24 GB) → Mistral 7B fp16 (comfortable), Phi-4 14B1x A100 (80 GB) → Llama 3.1 8B, Llama 3.3 70B (quantized AWQ)2x A100 (160 GB)→ Llama 3.3 70B (full fp16 precision)
Step 2: What is your primary task?
| Task | Recommended model | Reason |
|---|---|---|
| Customer support / chat | Mistral 7B | Fast, cheap, strong instruction following |
| Code generation | Llama 3.3 70B | Best open-source code quality |
| Math / STEM / reasoning | Phi-4 Mini | Beats GPT-4o on MATH benchmark (80.4% vs 74.6%) |
| Long documents / RAG | Phi-3 Mini 128K or Llama 3.3 70B | 128K context window |
| Multi-turn agents / tool use | Llama 3.3 70B | Best open-source tool-use as of 2026 |
| Edge / batch classification | Phi-3 Mini or Llama 3.2 3B | Small, fast, cheap |
Step 3: License requirements
| License | Models | Restrictions |
|---|---|---|
| MIT | Phi family | None — zero ambiguity, no attribution required |
| Apache 2.0 | Mistral 7B / Mixtral | No meaningful restrictions |
| Llama Community License | Llama 3.x | OK for <700M MAU; cannot be used to build competing foundation models |
Step 4: Do you need fine-tuning?
If yes: Mistral 7B or Llama 3.1 8B — most tooling, KAITO QLoRA support, most community resources.
Model comparison
All models listed are available as open weights for self-hosting. Organized by minimum GPU required.
T4 tier — NC4as/NC8as/NC16as_T4_v3 ($0.53–$1.20/hr)
| Model | Params | MMLU | License | Azure SKU | Notes |
|---|---|---|---|---|---|
| Phi-4 Mini | 3.8B | 67.3% ⁵ | MIT | NC4as_T4_v3 | Math/reasoning at T4 budget |
| Phi-3 Mini 128K | 3.8B | 69.7% ⁵ | MIT | NC4as_T4_v3 | 128K context on T4; RAG |
| Gemma 3 4B | 4B | ~60% ¹ | Gemma ToU | NC4as_T4_v3 | Native text+image multimodal |
| Mistral 7B AWQ | 7B | 60.1% ² | Apache 2.0 | NC4as_T4_v3 | High-volume chat; fp16 is too tight for T4 |
| Qwen2.5-7B AWQ | 7B | 74.2% ⁵ | Apache 2.0 | NC8as_T4_v3 | Highest MMLU in T4 tier; 29 languages; KAITO-supported |
| DeepSeek R1 Distill 7B AWQ | 7B | ~57% ¹ | MIT | NC8as_T4_v3 | Chain-of-thought reasoning on T4; beats GPT-4o on MATH-500 |
Single A100 80GB — NC24ads_A100_v4 ($3.67/hr)
| Model | Params | MMLU | License | Notes |
|---|---|---|---|---|
| Llama 3.1 8B | 8B | 73.0% ³ | Llama Community | General purpose; strong fine-tuning ecosystem |
| Gemma 3 12B | 12B | ~74% ¹ | Gemma ToU | Multimodal; strong multilingual |
| Qwen2.5-14B | 14B | 79.7% ⁵ | Apache 2.0 | Best mid-tier MMLU; 128K context |
| DeepSeek R1 Distill 32B AWQ | 32B | ~78% ¹ | MIT | Reasoning beats o1-mini; ~24GB at AWQ int4 |
| Gemma 3 27B | 27B | 78.6% ⁵ | Gemma ToU | Chatbot Arena Elo 1338 — outranks models 10× its size |
| Mistral Small 3.1/3.2 | 24B | ~80.6% | Apache 2.0 | 3× throughput vs Llama 70B; 128K context; optional vision |
Dual A100 — NC48ads_A100_v4 ($7.34/hr)
| Model | Params | MMLU | License | Notes |
|---|---|---|---|---|
| Llama 3.3 70B | 70B | 86.0% ³ | Llama Community | Best open tool-use and agents |
| Qwen2.5-72B | 72B | 86.1% ⁵ | Tongyi Qianwen ⁴ | Stronger math/code than Llama 70B; multilingual |
H100 cluster — ND96isr_H100_v5 (8× H100 SXM 80GB)
| Model | Params (total / active) | MMLU | License | Notes |
|---|---|---|---|---|
| Llama 4 Scout | 109B / 17B MoE | ~79.6% | Llama Community | 10M context; multimodal; single H100 at int4 |
| DeepSeek R1 | 671B / 37B MoE | 90.8% | MIT | Best open reasoning model; 8× H100 at FP8 |
| Kimi K2 / K2.5 | 1T / 32B MoE | 89.5% | Mod. MIT | Top open coding/agents; ~500GB int4; 4–8× H100 |
¹ Approximate — not officially published as standalone MMLU for these variants. ² Mistral 7B v0.3 has no separately published MMLU score; v0.3 changed only the tokenizer. Figure is from the original v0.1 paper. ³ Instruct model, 0-shot CoT. Base model 5-shot: Llama 3.1 8B = 66.7%; Llama 3.3 70B = 79.3%. ⁴ Tongyi Qianwen license: commercial use permitted with attribution; no meaningful restrictions for standard enterprise deployment. ⁵ 5-shot evaluation unless noted otherwise.


Recommended starting sequence
- Qwen2.5-7B or Phi-4 Mini — validate your pipeline on T4 ($0.53–0.75/hr)
- Mistral Small 3.1 or Qwen2.5-14B — validate quality at the mid-tier (single A100)
- Llama 3.3 70B or Qwen2.5-72B — set your quality ceiling
- DeepSeek R1 — if reasoning/math is critical, benchmark this before deciding you need a proprietary model
- Azure OpenAI GPT-4o or Claude — if none of the above meets your bar, now you have a concrete comparison point
How to Size the Stack
Getting this order right is critical. Pod config depends on the node. Replica count depends on pod config. Start at node selection.
Step 0: Measure your workload first
Every sizing decision downstream depends on two numbers: p95 prompt tokens and p95 completion tokens. Measure them separately — they matter differently. Prompt tokens drive KV cache prefill and max-model-len; completion tokens drive throughput and generation time.
If you’re already calling Azure OpenAI or the OpenAI API:
Every response includes token counts in the usage field. Log them and compute p95:
import numpy as np# Collect usage objects from your API response logs# e.g. {"prompt_tokens": 512, "completion_tokens": 287}samples = [...] # your logged usage objectsprompt = [s["prompt_tokens"] for s in samples]completion = [s["completion_tokens"] for s in samples]print(f"p95 prompt: {np.percentile(prompt, 95):.0f} tokens")print(f"p95 completion: {np.percentile(completion, 95):.0f} tokens")print(f"p95 total: {np.percentile([p+c for p,c in zip(prompt,completion)], 95):.0f} tokens")
Token counts are also in Azure Monitor → Metrics → Azure OpenAI → Token Transaction, exportable as CSV.
If you’re already running vLLM:
Query the built-in Prometheus histograms:
-- p95 prompt tokens (PromQL for Azure Managed Prometheus)histogram_quantile(0.95, rate(vllm:request_prompt_tokens_bucket[1h]))-- p95 completion tokenshistogram_quantile(0.95, rate(vllm:request_generation_tokens_bucket[1h]))
Or directly from the metrics endpoint:
kubectl port-forward svc/<vllm-service> 8000:8000 -n inference &curl -s http://localhost:8000/metrics | grep -E 'request_prompt_tokens|request_generation_tokens'
If you’re starting from scratch:
Count tokens on 200–500 representative real prompts using the target model’s tokenizer:
from transformers import AutoTokenizerimport numpy as nptokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.3")# Assemble prompts exactly as your app would send them# (system prompt + user message + any few-shot examples)prompts = ["You are a helpful assistant.\n\nUser: <real example>", ...]counts = [len(tokenizer.encode(p)) for p in prompts]print(f"p50: {np.percentile(counts, 50):.0f} p95: {np.percentile(counts, 95):.0f} max: {max(counts)}")
For completion tokens, run 100 real requests against a pilot deployment and measure output length. Completion length varies by model and instruction phrasing — you cannot reliably estimate it without running the model.
Starting estimates if you have no data yet:
| Use case | p95 prompt | p95 completion |
|---|---|---|
| Customer support chat | 400–800 | 150–300 |
| RAG (retrieval + question) | 1,500–4,000 | 200–500 |
| Code generation | 500–2,000 | 500–2,000 |
| Document summarization | 4,000–32,000 | 300–800 |
| Multi-turn agents with tool calls | 2,000–8,000 | 500–2,000 |
Validate these estimates before locking in GPU tier and max-model-len. A RAG workload sized on customer-support assumptions will OOM under real traffic.
The governing equation
Total VRAM required = Weights + KV Cache + Runtime Overhead (10–20%)
Step 1: Select the GPU node (VRAM calculation)
1a. Weights memory
Model weights load entirely into VRAM before the first token is generated.
Weights (GB) = Parameter count (billions) × bytes per parameterPrecision Bytes/param Notes───────────────────────────────────────────────────────────────fp16 2.0 Default. Works on T4, V100, A100bfloat16 2.0 Preferred on A100/H100 (better range)int8 1.0 ~0–2% quality lossint4 (AWQ) 0.5 ~1–3% quality loss, needs pre-quantized checkpoint
| Model | Params | fp16 / bf16 | int8 | int4 (AWQ) |
|---|---|---|---|---|
| Phi-4 Mini | 3.8B | 7.6 GB | 3.8 GB | 1.9 GB |
| Gemma 3 4B | 4B | 8.0 GB | 4.0 GB | 2.0 GB |
| Mistral 7B | 7.3B | 14.6 GB | 7.3 GB | 3.7 GB |
| Qwen2.5-7B | 7B | 14.0 GB | 7.0 GB | 3.5 GB |
| Llama 3.1 8B | 8B | 16.0 GB | 8.0 GB | 4.0 GB |
| Gemma 3 12B | 12B | 24.0 GB | 12.0 GB | 6.0 GB |
| Qwen2.5-14B | 14B | 28.0 GB | 14.0 GB | 7.0 GB |
| Mistral Small 3.1 | 24B | 48.0 GB | 24.0 GB | 12.0 GB |
| Gemma 3 27B | 27B | 54.0 GB | 27.0 GB | 13.5 GB |
| DeepSeek R1 Distill 32B | 32B | 64.0 GB | 32.0 GB | 16.0 GB |
| Llama 3.3 70B | 70.6B | 141 GB | 70.6 GB | 35.3 GB |
| Qwen2.5-72B | 72B | 144 GB | 72.0 GB | 36.0 GB |
| DeepSeek R1 / Kimi K2 | 671B / 1T | impractical | impractical | ~335 GB / ~500 GB |
Rule: If weights occupy more than 70% of available VRAM, go up one GPU tier. The remaining 30% is for KV cache + overhead. At 90%+ on weights alone, you will OOM under any real concurrency.
1b. KV Cache memory
KV cache is the part most people underestimate. It grows with the number of concurrent requests, the sequence length, and the model’s attention structure. It can exceed weights memory under load.
Simplified rule of thumb — KV cache per concurrent request per 1K context tokens:
| Model | KV per req per 1K tokens (fp16) |
|---|---|
| Phi-4 Mini | ~0.25 MB |
| Mistral 7B | ~0.50 MB |
| Llama 3.1 8B (GQA) | ~0.25 MB |
| Llama 3.3 70B (GQA) | ~1.25 MB |
Example — Mistral 7B, 32 concurrent requests, 4K context:
32 requests × 4 (1K blocks) × 0.50 MB = 64 MB ← negligible
Example — Mistral 7B, 32 concurrent requests, 32K context:
32 × 32 × 0.50 MB = 512 MB ← still manageable
Long context (32K+) with high concurrency is where KV cache dominates. Use --kv-cache-dtype fp8 to halve it on A100/H100.
1c. GPU selection table
Selection rule: Usable VRAM > Weights × 1.3
| Model + dtype | Weights | Min usable VRAM | GPU tier | Azure SKU | $/hr |
|---|---|---|---|---|---|
| Phi-4 Mini (fp16) | 7.6 GB | 9.9 GB | T4 16 GB | NC4as_T4_v3 | $0.53 |
| Mistral 7B / Qwen2.5-7B (AWQ) | 3.5–3.7 GB | 4.6–4.8 GB | T4 16 GB | NC4as_T4_v3 | $0.53 |
| Mistral 7B (fp16) | 14.6 GB | 19.0 GB | T4 too tight — use AWQ | NC16as_T4_v3 | $1.20 |
| Qwen2.5-14B (fp16) | 28.0 GB | 36.4 GB | A100 80 GB | NC24ads_A100_v4 | $3.67 |
| Llama 3.1 8B (fp16) | 16.0 GB | 20.8 GB | A100 80 GB | NC24ads_A100_v4 | $3.67 |
| Mistral Small 3.1 (fp16) | 48.0 GB | 62.4 GB | A100 80 GB | NC24ads_A100_v4 | $3.67 |
| Gemma 3 27B (fp16) | 54.0 GB | 70.2 GB | A100 80 GB | NC24ads_A100_v4 | $3.67 |
| DeepSeek R1 Distill 32B (AWQ) | 16.0 GB | 20.8 GB | A100 80 GB | NC24ads_A100_v4 | $3.67 |
| Llama 3.3 70B (fp16) | 141 GB | 183 GB | 2× A100 80 GB | NC48ads_A100_v4 | $7.34 |
| Qwen2.5-72B (fp16) | 144 GB | 187 GB | 2× A100 80 GB | NC48ads_A100_v4 | $7.34 |
| Llama 4 Scout (int4) | ~55 GB | ~72 GB | H100 80 GB | ND96isr_H100_v5 | ~$98 |
| DeepSeek R1 (FP8) | ~335 GB | ~436 GB | 8× H100 80 GB | ND96isr_H100_v5 | ~$98 |
| Kimi K2/K2.5 (int4) | ~500 GB | ~650 GB | 8× H100 80 GB | ND96isr_H100_v5 | ~$98 |
Mistral 7B on T4 warning: weights fill 14.6 GB of 16 GB — only 1.8 GB left for KV cache. You must set max-model-len: 2048 and max-num-seqs: 16 or you will OOM. Mistral 7B int4 AWQ is strongly recommended on T4.
Step 2: Configure the vLLM pod (four parameters)
These four parameters interact — changing one affects the others. Set them in this order.
Parameter 1: gpu-memory-utilization
gpu-memory-utilization: 0.90 # Good default
This controls how much total VRAM vLLM claims at startup for weights + KV cache combined. It does not mean 90% goes to weights — the model loads first, and whatever is left within this fraction becomes KV cache.
Increase to 0.95 → more KV cache → higher concurrency or longer contextDecrease to 0.85 → if you get random OOMKilled under moderate loadNever use 1.0 → CUDA needs a reservation for kernels and activations
Parameter 2: max-model-len — your real throughput knob
This sets the ceiling on total tokens per request (input + output). It directly controls how much KV cache each request can consume.
The most common mistake: setting max-model-len: 131072 when your app sends 500-token prompts and expects 300-token responses. This wastes enormous KV cache reservation.
Recommended formula: max-model-len = 2 × p95(actual prompt tokens + completion tokens)Example: p95 prompt: 500 tokens p95 completion: 300 tokens → max-model-len: 2048 (not 128K just because the model supports it)
Effect of halving max-model-len — you roughly double concurrent capacity:
Mistral 7B on T4, remaining VRAM for KV = 1.8 GB: max-model-len: 4096 → ~14 concurrent sequences max-model-len: 2048 → ~28 concurrent sequences max-model-len: 1024 → ~56 concurrent sequences
Parameter 3: max-num-seqs — concurrency ceiling
max-num-seqs: 64 # Maximum concurrent sequences in the vLLM scheduler
When this ceiling is hit, new requests queue and wait. Starting formula:
max-num-seqs = floor(available_kv_vram_GB / kv_per_seq_at_max_model_len_GB)Example — Phi-4 Mini on T4: Weights: 7.6 GB Available for KV: 16 × 0.90 − 7.6 = 6.8 GB KV per seq @ 4K: 4 × 0.25 MB = 1 MB → floor(6.8 / 0.001) ≈ 6,800 (GPU-limited, not formula-limited) → Start at 64, validate with load test
In practice: start at 32–64, run a load test, watch vllm:gpu_cache_usage_perc in Prometheus. Increase until it hits 85–90% under expected peak load.
Parameter 4: dtype
dtype: "float16" # T4, V100 — no native bfloat16dtype: "bfloat16" # A100, H100 — better numerical stability, same memory
Reference configs by GPU tier
# T4 16 GB — Phi-4 Mini (comfortable)vllm: gpu-memory-utilization: 0.90 max-model-len: 8192 max-num-seqs: 64 dtype: "float16" enable-prefix-caching: true# T4 16 GB — Mistral 7B fp16 (tight — reduce if OOM)vllm: gpu-memory-utilization: 0.92 max-model-len: 2048 # Conservative on T4 for Mistral fp16 max-num-seqs: 16 dtype: "float16"# T4 16 GB — Mistral 7B int4 AWQ (recommended on T4)vllm: gpu-memory-utilization: 0.90 max-model-len: 8192 max-num-seqs: 64 dtype: "float16" quantization: "awq"# V100 16 GB — Llama 3.1 8Bvllm: gpu-memory-utilization: 0.95 max-model-len: 4096 max-num-seqs: 32 dtype: "float16" enable-prefix-caching: true# A100 80 GB — Llama 3.1 8B (lots of headroom)vllm: gpu-memory-utilization: 0.90 max-model-len: 32768 max-num-seqs: 256 dtype: "bfloat16" enable-prefix-caching: true enable-chunked-prefill: true# 2x A100 80 GB — Llama 3.3 70Bvllm: gpu-memory-utilization: 0.90 max-model-len: 8192 max-num-seqs: 64 dtype: "bfloat16" tensor-parallel-size: 2 enable-prefix-caching: true
Additional flags worth knowing:
enable-prefix-caching: true# Reuses KV cache for identical prompt prefixes across requests.# Major win for chatbots (same system prompt) and RAG (same retrieval context).# No cost — enable by default.enable-chunked-prefill: true# Processes long prompts in chunks instead of one shot.# Prevents a long prefill from starving concurrent short requests.# Use when you mix short and long prompts.kv-cache-dtype: "fp8"# Halves KV cache memory vs fp16.# Allows 2x more concurrent sequences for the same VRAM.# Requires A100/H100. ~0.5% quality degradation.swap-space: 4# CPU RAM (GB) vLLM can use to offload KV cache blocks when VRAM is full.# Acts as a spillover buffer. Set to 4–16 GB on nodes with large system RAM.
Step 3: Pod resource requests
resources: requests: nvidia.com/gpu: "1" # Always 1 per vLLM pod. # vLLM owns the GPU exclusively — never share. cpu: "4" # Tokenizer + HTTP server: 2–4 cores sufficient. # GPU is the bottleneck, not CPU. memory: "24Gi" # Weights load through CPU RAM first, then copy to VRAM. # Set to ~1.5× weights size. limits: nvidia.com/gpu: "1" # Prevents accidental multi-GPU scheduling. cpu: "8" # Allow burst for prefill spikes. memory: "32Gi" # Headroom for Python runtime + buffers.
Why CPU and RAM are not your bottleneck: vLLM’s hot path (attention, sampling) runs entirely on GPU. The CPU handles HTTP parsing, tokenization, and scheduling — lightweight tasks that rarely exceed 2–3 cores even at high throughput. Over-provisioning CPU doesn’t help. Over-provisioning GPU does.
Step 4: Replica count and KEDA alignment
A single vLLM pod owns one GPU and processes one batch at a time. Horizontal scaling (more pods, more GPU nodes via NAP) is the primary way to increase total throughput.
Required replicas = ceil(peak_concurrent_users / max-num-seqs)Example — Mistral 7B on T4, max-num-seqs = 32: Peak concurrent users: 200 → ceil(200 / 32) = 7 replicas = 7 GPU nodes (NC16as_T4_v3) → Cost at peak: 7 × $1.20/hr = $8.40/hr → Cost at idle: $0 — NAP deprovisions all 7 nodes
For bursty workloads, size for p95 concurrency, not the all-time peak. Let KEDA handle spikes up to maxReplicaCount.
KEDA ScaledObject alignment — your max-num-seqs should inform your KEDA threshold:
triggers: - type: prometheus metadata: query: avg(vllm:num_requests_waiting{namespace="inference"}) threshold: "5" # Add a replica when avg waiting > 5 activationThreshold: "1"minReplicaCount: 1 # Keep 1 warm to avoid cold-start on first requestmaxReplicaCount: 7 # ceil(200 peak users / 32 max-num-seqs)
Scaling decision guide:
| Signal | Action |
|---|---|
vllm:num_requests_waiting > 0 consistently | Add replicas |
vllm:gpu_cache_usage_perc > 90% | Reduce max-num-seqs OR add replicas |
| GPU utilization < 40% at peak | Pod/GPU oversized — go down a tier |
| OOMKilled | Reduce max-num-seqs or max-model-len |
| TTFT > SLO at low concurrency | GPU too slow — go up one tier |
Quick reference: all parameters together
| Model | GPU | max-model-len | max-num-seqs | dtype | tensor-parallel |
|---|---|---|---|---|---|
| Phi-4 Mini | T4 16GB | 8192 | 64 | float16 | 1 |
| Phi-3 Mini 128K | T4 16GB | 8192 | 32 | float16 | 1 |
| Mistral 7B (fp16) | T4 16GB | 2048 | 16 | float16 | 1 |
| Mistral 7B (AWQ) | T4 16GB | 8192 | 64 | float16 | 1 |
| Llama 3.1 8B | V100 16GB | 4096 | 32 | float16 | 1 |
| Llama 3.3 70B | 2x A100 80GB | 8192 | 64 | bfloat16 | 2 |
Cost Awareness
| Component | When billed | Approx. cost |
|---|---|---|
| System node pool (D4ds_v5 ×2) | Always | ~$0.37/hr total |
| NC4as_T4_v3 (Phi-4 Mini) | Only when NAP provisions | ~$0.53/hr |
| NC16as_T4_v3 (Mistral 7B) | Only when NAP provisions | ~$1.20/hr |
| NC6s_v3 (Llama 3 8B) | Only when NAP provisions | ~$0.90/hr |
| NC24ads_A100_v4 (Llama 3 70B) | Only when NAP provisions | ~$3.67/hr per node |
NAP deprovisions GPU nodes after 2 minutes of idle. A dev/test workflow running occasional requests pays for GPU time only while actively inferencing — often under $5/day.
Architecture Overview

Component Deep Dives
KEDA — Kubernetes Event-Driven Autoscaling
What problem it solves: Standard HPA scales on CPU/memory — meaningless for LLM inference where GPUs are the bottleneck and requests arrive unpredictably. KEDA watches external event sources and scales Deployments based on real demand signals.
The scale-to-zero trick: HPA requires at least one running pod to collect metrics. KEDA bypasses this by monitoring event sources directly from its operator — no running pod needed. When demand arrives, it sets replicas from 0 → 1 before HPA ever gets involved.
Three trigger modes in this lab:
| Trigger | File | Best For |
|---|---|---|
| HTTP Add-on | keda/1-http-scaledobject.yaml | Synchronous inference API; buffers requests while pods cold-start |
| Service Bus Queue | keda/2-servicebus-scaledobject.yaml | Async batch inference; message durability; decoupled producers |
| Azure Managed Prometheus | keda/3-prometheus-scaledobject.yaml | React to GPU utilization or vLLM internal queue depth |
HTTP Add-on internals: The HTTP add-on installs an interceptor proxy (2 replicas) that sits in front of your Service. All traffic routes through it. When a request arrives and the target deployment has 0 replicas, the proxy holds the connection open, signals KEDA to scale up, and forwards the request once the pod is ready. This is transparent to the client — they just see extra latency on the first request.
HTTP Add-on production caveats:
- Always-on cost: the 2 interceptor replicas run continuously regardless of inference traffic (~$0.35/hr on Standard_D2ds_v5). This is not included in scale-to-zero savings calculations and is the minimum cost floor for HTTP-triggered scaling.
- Cold-start timeout: the proxy has a finite wait timeout. If NAP provisioning
- container pull + model load exceeds it, the client receives a 503 even if the pod eventually becomes ready. Set
scaledownPeriodhigh enough that the GPU node stays warm between requests during active usage periods.
- container pull + model load exceeds it, the client receives a 503 even if the pod eventually becomes ready. Set
- Long-generation workloads: for models that generate responses taking >60s (e.g. Llama 3.3 70B on complex prompts), use the Service Bus trigger instead — it provides durable buffering with no proxy timeout constraint.
Key tuning parameters:
cooldownPeriod: 120 # Seconds of idle before scaling to zero. # For LLMs: set higher than your longest generation. # Killing a pod mid-generation loses the response.pollingInterval: 15 # How often KEDA queries the trigger source. # Lower = faster reaction, more API calls to Azure.activationThreshold: 1 # Queue depth that triggers scale from 0 → 1. # Keep at 1 for interactive use cases.threshold: 5 # Target metric value per replica. # "Add a replica per 5 queued messages" or # "Add a replica when GPU util > 70%"
Authentication (no secrets stored): KEDA’s TriggerAuthentication uses azure-workload provider. The KEDA operator ServiceAccount is federated with an Azure managed identity (via Terraform). It exchanges its OIDC token for a scoped AAD token at query time. Connection strings never touch etcd.
NAP — Node Auto Provisioning (Karpenter on AKS)
Preview status: NAP is in public preview as of early 2026. Microsoft’s support boundary for preview features differs from GA — it is not covered by an SLA and breaking changes may occur. Verify current status at aka.ms/aks-nap before using in production. GPU SKU availability varies significantly by region — NC4as_T4_v3 is broadly available but H100 SKUs are quota-limited in most regions. Request quota at aka.ms/AzureGPUQuota before designing for specific GPU families.
Spot GPU instances: Azure spot pricing reduces GPU costs by ~75–80% (NC4as_T4_v3: ~$0.10/hr vs $0.53/hr on-demand). The lab includes a spot NodePool (gpu-inference-spot) for async/batch workloads. Do not use spot for synchronous HTTP inference or KAITO workloads. KAITO’s do-not-disrupt annotation blocks Karpenter’s voluntary consolidation but does not block Azure spot eviction (an involuntary interruption) — the GPU node will still be evicted mid-inference. Spot is safe for Service Bus queue workers where jobs requeue on failure. See manifests/nap/gpu-nodepool.yaml for the two-NodePool setup (on-demand primary, spot secondary).
What problem it solves: Classic AKS cluster autoscaler requires pre-created node pools with fixed VM sizes. If you don’t have a GPU node pool, GPU-requesting pods stay Pending forever. NAP replaces this with a Karpenter-based controller that analyzes each pending pod’s resource requirements and dynamically provisions the optimal VM.
How selection works:
Pending pod requests: nvidia.com/gpu: 1 memory: 16Gi cpu: 4NAP evaluates NodePool requirements: sku-family: NC sku-name: [NC4as_T4_v3, NC8as_T4_v3, NC16as_T4_v3, NC6s_v3, NC24ads_A100_v4] capacity-type: on-demandNAP picks the cheapest SKU that fits all requests: → Standard_NC4as_T4_v3 (1x T4, 28GiB RAM, 4 vCPU) wins → VM provisions, joins cluster, pod schedules
GPU node lifecycle:
Pod pending → NAP provisions node (3-5 min)Pod running → model loads → inference startsPod completed / scaled to 0 → node idleconsolidateAfter: 2m → NAP deprovisions nodeGPU billing stops
Important: KAITO sets karpenter.sh/do-not-disrupt: "true" on every NodeClaim it creates (source). This blocks Karpenter’s voluntary disruption (consolidation) — the GPU node stays alive as long as the Workspace CRD exists, even when replicas are scaled to zero by KEDA. do-not-disrupt only blocks voluntary disruption; when the Workspace is deleted, KAITO’s GC finalizer deletes the NodeClaim directly (workspace_gc_finalizer.go), which bypasses the annotation and terminates the node. KAITO’s official KEDA integration (v0.8.0+) scales pods only and uses minReplicaCount: 1 in all examples — GPU node scale-to-zero is not supported (issue #306). See KAITO vs vLLM Standalone below.
Key CRDs in this lab:
NodePool(Karpenter API) — constraints: GPU SKU families, capacity type (spot vs on-demand), architecture, taintsAKSNodeClass(Azure extension) — VNet/subnet ID, OS disk size, image family
GPU taint/toleration pattern:
# NodePool applies this taint to every GPU node it provisions:taints: - key: nvidia.com/gpu value: "true" effect: NoSchedule# Pods must declare this toleration to land on GPU nodes:tolerations: - key: nvidia.com/gpu operator: Equal value: "true" effect: NoSchedule
This ensures CPU workloads never accidentally schedule onto expensive GPU VMs.
Cost guard:
limits: nvidia.com/gpu: "8" # Hard cap: NAP won't provision beyond 8 GPUs total. # Without this, a misconfigured workload can exhaust # your entire Azure GPU quota.
KAITO — Kubernetes AI Toolchain Operator
What problem it solves: Deploying an LLM on Kubernetes without KAITO requires: knowing the right GPU SKU, writing vLLM startup args, configuring tensor parallelism, managing GPU driver plugin DaemonSets, writing readiness probes tuned to 2-minute model load times, and more. KAITO wraps all of this into a single 15-line Workspace CRD.
What KAITO does when you kubectl apply a Workspace:
1. Reads the Workspace spec (instanceType, preset name)2. Validates GPU SKU has enough VRAM for the model3. Creates a Deployment with correct GPU requests + tolerations4. Creates a ConfigMap with vLLM startup arguments5. Creates a ClusterIP Service named after the workspace6. Monitors the Deployment → updates Workspace status conditions
NAP provisions the GPU node in parallel (step 3 triggers it).
NC6s_v3 (V100) is an older GPU generation being progressively retired from Azure regions. Verify availability in your target region before depending on it. If unavailable, NC8as_T4_v3 is the recommended alternative for Llama 3.1 8B with quantization (AWQ/GPTQ reduces VRAM requirement to ~10 GB).
Preset model matrix in this lab:
| KAITO Preset | File | Min GPU | Min VRAM | Approx GPU VM |
|---|---|---|---|---|
phi-4-mini-instruct | workspace-phi4-mini.yaml | 1x T4 | 8 GB | NC4as_T4_v3 |
phi-3-mini-128k-instruct | workspace-phi3-mini.yaml | 1x T4 | 10 GB | NC8as_T4_v3 |
mistral-7b-instruct | workspace-mistral-7b.yaml | 1x T4 | 14 GB | NC16as_T4_v3 |
llama-3.1-8b-instruct | workspace-llama3-8b.yaml | 1x V100 | 16 GB | NC6s_v3 |
llama-3.3-70b-instruct | workspace-llama3-70b.yaml | 2x A100 | 160 GB | 2x NC24ads_A100_v4 |
vLLM ConfigMap tuning: KAITO passes inference config via a ConfigMap referenced in the Workspace. Key vLLM parameters for LLM workloads:
vllm: gpu-memory-utilization: 0.90 # Fraction of VRAM reserved for KV cache. # Higher = more context/batch. Leave 10% margin. max-model-len: 4096 # Maximum sequence length (input + output). # Reduce to fit in VRAM if OOM. max-num-seqs: 64 # Max concurrent sequences in the scheduler. # Each sequence consumes KV cache memory. dtype: "float16" # T4/V100: use float16. A100/H100: use bfloat16. enable-prefix-caching: true # Cache KV for repeated system prompts. # Big win for chatbot workloads (same system prompt).
KAITO vs vLLM Standalone:
| KAITO | vLLM Standalone | |
|---|---|---|
| Model packaging | Pre-built MCR images — no HuggingFace token needed | Pull from HuggingFace or your own registry |
| GPU validation | Validates VRAM before scheduling | Fails at runtime (OOM) |
| Multi-node (70B+) | Handled automatically (Ray topology) | Manual Ray configuration |
| vLLM version control | Pinned to KAITO release | Any version |
| True GPU scale-to-zero | ✗ — do-not-disrupt pins the node | ✓ — NAP deprovisions freely |
| Cold start (node warm) | Fast — image cached on node | Fast — image cached on node |
Use KAITO for always-on or near-always-on workloads (minReplicaCount: 1), multi-node large models, or when you want preset GPU validation with minimal YAML.
Use vLLM Standalone (manifests/vllm/vllm-standalone.yaml) when true GPU billing scale-to-zero is required — bursty workloads, dev/lab environments, or any scenario where the GPU should deprovision during idle periods. Also use standalone for custom LoRA adapters, quantized (GGUF/AWQ) weights, or a vLLM version newer than what KAITO packages.
Checking workspace status:
kubectl get workspace -n inference# NAME INSTANCE RESOURCEREADY INFERENCEREADY WORKSPACEREADY# workspace-phi4-mini Standard_NC4as_T4_v3 True True Truekubectl describe workspace workspace-phi4-mini -n inference# Look at the Conditions section for detailed status
vLLM — OpenAI-Compatible Inference Server
Why vLLM (not TGI, Ollama, etc.):
- PagedAttention: manages KV cache as virtual memory pages → higher throughput
- Continuous batching: processes multiple requests in parallel without waiting
- OpenAI API compatibility: drop-in replacement for GPT-4 clients (no SDK change)
- Tensor parallelism: split a model across multiple GPUs in one line (
--tensor-parallel-size 2) - Prefix caching: reuse KV cache for repeated system prompts (significant for chatbots)
OpenAI-compatible endpoints:
POST /v1/chat/completions — ChatGPT-style multi-turn conversationPOST /v1/completions — Legacy text completionGET /v1/models — Lists available modelsGET /health — Readiness probe endpointGET /metrics — Prometheus metrics (queue depth, TTFT, throughput)
Cold-start latency breakdown:
| Phase | Duration | Notes |
|---|---|---|
| NAP VM provision | 3-5 min | Only if no GPU node available |
| Container pull | 1-2 min | vLLM image ~8GB; faster after first pull |
| Model download | 2-10 min | From HuggingFace; cached in PVC after first run |
| Model load to VRAM | 30-120s | Proportional to model size |
| vLLM ready | ~10s | After model loaded |
Use a PVC for model caching (see vllm-standalone.yaml). Without it, every pod restart re-downloads the full model. With it, cold start goes from 10+ minutes to under 2 minutes after the first run.
Workload Identity
Why not connection strings or Kubernetes Secrets: Secrets stored in etcd are base64-encoded (not encrypted) by default. Rotation requires a redeployment. If your etcd backup leaks, all secrets leak. Workload Identity eliminates the problem entirely.
The OIDC federation chain:
Kubernetes Pod ↓ ServiceAccount projected token (JWT, short-lived, in /var/run/secrets/)Azure Workload Identity Webhook ↓ injects AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_FEDERATED_TOKEN_FILEAzure AD ↓ validates OIDC token against the cluster's OIDC issuer URL ↓ checks subject matches the federated credential (system:serviceaccount:ns:sa) ↓ issues an AAD access token scoped to the requested resourceAzure Resource (Key Vault / Service Bus / Foundry) ↓ validates AAD token → grants access
Three identities in this lab:
| Identity | Used By | Permissions |
|---|---|---|
kaito-identity | KAITO GPU provisioner | Contributor on AKS cluster (to provision nodes) |
keda-identity | KEDA operator | Monitoring Data Reader (Prometheus), Service Bus Data Owner |
workload-identity | Inference pods | Key Vault Secrets User, Service Bus Data Sender/Receiver |
Secrets Store CSI Driver: Mounts Key Vault secrets as files inside pods at /mnt/secrets/. Combined with the secretObjects block in SecretProviderClass, secrets are also mirrored as Kubernetes Secret objects (for workloads that read from env vars).
# Store your HuggingFace token in Key Vault (required for Llama 3):az keyvault secret set \ --vault-name <KEY_VAULT_NAME> \ --name hf-token \ --value "hf_xxxxxxxxxxxxxxxxxxxx"
Directory Structure
aks-ai-lab/├── terraform/│ ├── main.tf # AKS + NAP + KAITO + KEDA + Key Vault + Service Bus│ ├── variables.tf│ ├── outputs.tf│ └── terraform.tfvars.example│├── manifests/│ ├── nap/│ │ └── gpu-nodepool.yaml # Karpenter NodePool + AKSNodeClass for GPU nodes│ ││ ├── kaito/│ │ ├── namespace.yaml│ │ ├── workspace-phi4-mini.yaml # Cheapest: T4 16GB│ │ ├── workspace-phi3-mini.yaml # 128K context: T4 16GB│ │ ├── workspace-mistral-7b.yaml # Balanced: T4 16GB│ │ ├── workspace-llama3-8b.yaml # Quality: V100 16GB│ │ └── workspace-llama3-70b.yaml # Premium: 2x A100 80GB│ ││ ├── keda/│ │ ├── 1-http-scaledobject.yaml # Scale on HTTP request concurrency│ │ ├── 2-servicebus-scaledobject.yaml # Scale on Service Bus queue depth│ │ └── 3-prometheus-scaledobject.yaml # Scale on GPU util / vLLM queue│ ││ ├── workload-identity/│ │ ├── serviceaccount.yaml # Federated SA for inference pods│ │ ├── secret-provider-class.yaml # Key Vault → pod file mounts│ │ └── keda-trigger-auth.yaml # KEDA → Azure auth (no secrets)│ ││ ├── vllm/│ │ └── vllm-standalone.yaml # Direct vLLM deployment (non-KAITO)│ ││ ├── ingress/│ │ ├── 1-app-routing.yaml # AKS App Routing add-on (NGINX) — lab/dev│ │ ├── 2-app-gateway-containers.yaml # Application Gateway for Containers — production│ │ ├── 3-istio-gateway.yaml # Istio ingress + VirtualService — production│ │ └── 4-inference-extension.yaml # Gateway API Inference Extension — multi-replica│ ││ └── monitoring/│ └── dcgm-exporter.yaml # NVIDIA GPU metrics DaemonSet│├── tests/│ ├── TESTING.md # Test guide — what each test validates│ ├── 00-run-all-tests.sh # Run full test suite│ ├── 01-test-endpoint.sh # vLLM API surface validation│ ├── 02-test-keda-scaling.sh # Scale-up / scale-down lifecycle│ ├── 03-test-nap-lifecycle.sh # GPU node provision / deprovision│ ├── 04-test-load.sh # Throughput / concurrency benchmark│ └── 05-test-workload-identity.sh # OIDC → AAD → Key Vault chain│├── docs/│ ├── sizing-guide.md # Node / pod / replica sizing formulas│ └── ingress-guide.md # Ingress options, manifests, decision guide│└── scripts/ ├── 00-prereqs.sh # Tool versions, GPU quota, feature flag check ├── 01-deploy.sh # terraform apply + helm installs + namespace setup ├── 02-deploy-model.sh # kubectl apply a KAITO workspace + watch status └── 03-smoke-test.sh # port-forward + OpenAI API test + KEDA status
Quickstart
Prerequisites
- Azure subscription with NC-series GPU quota (request at https://aka.ms/AzureGPUQuota)
- Tools:
az,kubectl,helm,terraform,jq
1. Clone and configure
git clone <your-repo> aks-ai-labcd aks-ai-labcp terraform/terraform.tfvars.example terraform/terraform.tfvars# Edit terraform.tfvars — set subscription_id and location
2. Check prerequisites
chmod +x scripts/*.sh./scripts/00-prereqs.sh
3. Deploy infrastructure
./scripts/01-deploy.sh# Takes ~10 minutes. Creates AKS cluster with NAP, KAITO, KEDA add-ons,# Key Vault, Service Bus, managed identities, federated credentials.
4. Store secrets in Key Vault
# Required for Llama 3 (gated model). Optional for Phi/Mistral.az keyvault secret set --vault-name <KV_NAME> --name hf-token --value "hf_xxx"az keyvault secret set --vault-name <KV_NAME> --name foundry-api-key --value "xxx"
5. Deploy a model
# Start with Phi-4 Mini — fastest and cheapest (T4 GPU, ~$0.50/hr)./scripts/02-deploy-model.sh phi4-mini# Or deploy directly:kubectl apply -f manifests/kaito/workspace-phi4-mini.yaml# Watch NAP provision the GPU node:kubectl get nodes -w# NAME STATUS ROLES AGE VERSION# aks-system-xxxxx Ready agent 10m v1.31# (after 3-5 min):# aks-nc4ast4v3-xxxxx Ready agent 1m v1.31 ← GPU node!
6. Apply KEDA scaling
# Update placeholders in the KEDA manifests first:export SB_NS=$(terraform -chdir=terraform output -raw servicebus_namespace)sed -i "s|<SERVICEBUS_NAMESPACE>|$SB_NS|g" manifests/keda/2-servicebus-scaledobject.yamlexport AMW=$(az monitor account list -g rg-aks-ai-lab --query '[0].metrics.prometheusQueryEndpoint' -o tsv)sed -i "s|<AMW_ENDPOINT>|$AMW|g" manifests/keda/3-prometheus-scaledobject.yamlkubectl apply -f manifests/keda/
7. Run smoke test
./scripts/03-smoke-test.sh phi4-mini
Useful Commands
# Watch workspace statuskubectl get workspace -n inference -w# Check which GPU node NAP provisionedkubectl get nodes -l karpenter.azure.com/sku-family=NC# Watch KEDA scaling decisionskubectl get scaledobject -n inferencekubectl describe scaledobject inference-sb-scaler -n inference# Check GPU utilization inside a podkubectl exec -n inference <pod-name> -- nvidia-smi# View vLLM metrics (port-forward first)kubectl port-forward svc/workspace-phi4-mini 5000:5000 -n inference &curl http://localhost:5000/metrics | grep vllm# Force scale-to-zero (test cold-start)kubectl scale deployment workspace-phi4-mini -n inference --replicas=0# Send a Service Bus message (triggers KEDA scale-up)az servicebus queue message send \ --resource-group rg-aks-ai-lab \ --namespace-name <SB_NAMESPACE> \ --name inference-requests \ --body '{"model":"phi-4-mini-instruct","messages":[{"role":"user","content":"Hello"}]}'# Tear down everything (NAP deprovisions GPU nodes automatically)cd terraform && terraform destroy
Troubleshooting
Workspace stuck in Pending
kubectl describe workspace workspace-phi4-mini -n inferencekubectl get events -n inference --sort-by=.lastTimestamp# Common causes:# 1. GPU quota exhausted → request quota increase# 2. NAP NodePool limits reached → increase limits in gpu-nodepool.yaml# 3. Feature flags not registered → re-run 00-prereqs.sh
Pod OOMKilled
kubectl describe pod <pod-name> -n inference# Reduce max-model-len or max-num-seqs in the KAITO ConfigMap.# Or upgrade to a larger GPU SKU in the Workspace instanceType.
KEDA not scaling
kubectl describe scaledobject inference-sb-scaler -n inference# Check: "READY" = True, "ACTIVE" = True/False# Common causes:# 1. TriggerAuthentication misconfigured (wrong client ID)# 2. KEDA identity missing role on Service Bus / Prometheus# 3. Service Bus queue name mismatch
NAP not provisioning GPU nodes
kubectl get nodepool gpu-inference -o yaml# Check: limits not exceeded, SKU family allowed in requirementskubectl logs -n kube-system -l app=karpenter --tail=50
Ingress & Traffic Architecture
Ingress for LLM inference is not just a routing problem. It sits at the intersection of network security, API governance, cost control, and GPU utilization. A Kubernetes Ingress object alone addresses none of those.
Network Topology

Azure Front Door
Front Door sits on Microsoft’s anycast network with 200+ points of presence globally. It does three things you can’t skip for LLM:
- WAF — LLM endpoints are expensive to abuse. A single request generating 100K tokens costs real money. Without a WAF, anyone who discovers your endpoint runs up your GPU bill. Front Door’s WAF blocks OWASP attacks, bot traffic, and rate-limits at the edge before requests ever reach your infrastructure.
- DDoS protection — volumetric attacks are absorbed at the edge, not at your origin.
- Long-connection handling — LLM responses take 30–120 seconds for long generations. Front Door manages the client TCP connection and streaming response buffering, so your backend doesn’t have to worry about client timeouts on slow 4G connections.
Azure API Management
Standard request-count rate limiting is useless for LLM. 1,000 requests of 5 tokens each costs nothing. One request of 500K tokens can cost dollars of GPU time. APIM is the only layer that enforces limits on actual token consumption:
<llm-token-limit counter-key="@(context.Subscription.Id)" tokens-per-minute="10000" token-quota="5000000" token-quota-period="Monthly" />
Beyond rate limiting:
- Per-consumer quotas — different teams get different token budgets. Finance gets 10M tokens/month, a dev team gets 500K. Without this, one team can exhaust your GPU capacity and affect everyone.
- AAD authentication — verifies the caller’s identity before the request reaches the cluster. No anonymous calls to your GPU.
- Cost chargeback — logs tokens consumed per subscription/consumer to Application Insights. This is how you bill back to internal teams or external customers.
- Response caching — identical prompts never hit the GPU. Huge win for RAG workloads where 50 users ask the same question against the same document.
- Azure OpenAI fallback — if vLLM returns 503 (cold-starting, NAP provisioning a GPU node), APIM automatically falls back to Azure OpenAI API. The client sees no interruption, just a slightly slower response.
App Gateway for Containers
A standard Kubernetes LoadBalancer service operates at Layer 4 (TCP). It has no understanding of HTTP — no routing based on headers, no health checks based on HTTP response codes, no connection draining.
App Gateway for Containers (AGfC) is managed Envoy running outside your cluster. Azure operates the data plane — no CPU or memory overhead on your nodes. What it adds:
- Connection draining — when vLLM is scaling down (KEDA setting replicas to 0), AGfC stops sending new requests to that pod and waits for in-flight requests to finish. Without this, scaling down kills active generations mid-response.
- HTTP/2 and gRPC — vLLM supports both. A Layer 4 LB passes them through blindly; AGfC routes them intelligently.
- Health-based routing — routes only to pods that return 200 on
/health, not just pods that have a running TCP listener. A vLLM pod that’s still loading a 70B model will pass TCP health checks but not HTTP health checks.
Istio
Without Istio, any pod in the cluster that can reach the vLLM Service can call it directly. You have no encryption, no access control, and no observability below the HTTP layer.
- mTLS — all pod-to-pod traffic is encrypted and mutually authenticated using short-lived certificates. Only pods with the right ServiceAccount identity can call vLLM. This is the only way to enforce zero-trust inside the cluster.
- Circuit breakers — LLM pods can get stuck: model loading takes 2–5 minutes, and during that time the pod accepts connections but never responds. Istio’s circuit breaker detects this (response time exceeds threshold, error rate spikes) and stops routing to that pod, giving it time to recover instead of queuing 500 requests against a broken pod.
- Distributed tracing — each request gets a trace ID propagated end-to-end. When a user reports “my request took 90 seconds”, you can see: 2s in AGfC → 3s in Istio routing → 85s in vLLM generation. Without tracing, you’re guessing where the latency is.
- Retry policies — if a request hits a pod still initializing (returns 503), Istio retries automatically against a healthy pod. The client never sees the 503.
Gateway API Inference Extension
Standard Kubernetes load balancing is round-robin — it has no knowledge of what each vLLM pod is actually doing. This is a major performance problem for LLM specifically because of KV cache.
vLLM stores the KV cache (computed attention values for input tokens) in GPU memory. If your system prompt is 2,000 tokens and all 50 users of your chatbot share the same system prompt, any pod that has already processed that system prompt has the KV values cached. If the next request for that user hits a different pod, the pod has to recompute 2,000 tokens from scratch — wasted GPU compute.
The Inference Extension solves this with two mechanisms:
- Prefix-hash routing — hashes the prefix of the prompt (system prompt + conversation history) and routes to the pod most likely to have that prefix in its KV cache. For chatbot workloads where every user shares the same system prompt, cache hit rates go from near-zero to 80%+. This translates directly to 2–4× throughput on the same hardware.
- Load-aware routing — reads vLLM’s Prometheus metrics (queue depth, KV cache utilization, running requests) and routes new requests to the pod with the most available capacity. Standard round-robin ignores this — a pod with 50 queued requests gets the same weight as a pod with 0.
What This Lab Omits: Firewall
The lab uses a single-spoke VNet design. In production you typically add a hub VNet with a Firewall sitting between the public edge and your workloads:
Internet → AFD (WAF) → Firewall (hub) → APIM (spoke) → AKS (spoke)
The firewall gives you centralized egress logging (every pod outbound connection), threat intelligence filtering, and spoke-to-spoke isolation when multiple teams share the same landing zone. Without it, a compromised pod can reach any internet destination.
For production workloads with compliance requirements or multi-team AKS clusters, it becomes non-negotiable.
GitHub Repository
The lab repository includes Terraform for all infrastructure, Kubernetes manifests for every component, five test scripts validating the full lifecycle from API surface to KEDA scale-up/down to NAP node lifecycle, a sizing guide with the full VRAM formulas, and an ingress guide covering the production network topology:
https://github.com/rtrentinavx/aks-ai-lab
References
Azure Platform
- AKS Node Auto Provisioning (NAP) overview — Microsoft Learn
- Use NVIDIA GPUs on AKS — Microsoft Learn
- Use AMD GPUs on AKS — Microsoft Learn
- AKS Workload Identity overview — Microsoft Learn
- Azure API Management LLM token limit policy — Microsoft Learn
- Azure Front Door Premium WAF integration — Microsoft Learn
- Request GPU quota increase — Microsoft Azure
KAITO
- KAITO project repository — GitHub
- KAITO KEDA autoscaler integration — GitHub
- KAITO do-not-disrupt annotation source — GitHub
- KAITO workspace GC finalizer — GitHub
- KAITO issue #306 — GPU node scale-to-zero request — GitHub
Karpenter / NAP
- Karpenter disruption concepts (do-not-disrupt) — karpenter.sh
- AKS Karpenter node auto-provisioning NodePool configuration — Microsoft Learn
vLLM
- vLLM project repository — GitHub
- vLLM PagedAttention paper — Efficient Memory Management for Large Language Model Serving with PagedAttention — Kwon et al., 2023
KEDA
- KEDA HTTP add-on repository — GitHub
- KEDA scalers documentation — keda.sh
Gateway API / Envoy Gateway
- Kubernetes Gateway API specification — sigs.k8s.io
- Envoy Gateway documentation — gateway.envoyproxy.io
- Gateway API Inference Extension — GitHub
Model Benchmarks and Papers
- Phi-4 Mini technical report — Microsoft, 2025
- Phi-3 Mini model card — Microsoft / Hugging Face
- Mistral 7B paper — Jiang et al., 2023
- Qwen2.5 technical report — Alibaba Cloud, 2024
- Llama 3.1 8B model card — Meta / Hugging Face
- Llama 3.3 70B model card — Meta / Hugging Face
- Gemma 3 technical report — Google DeepMind, 2025
- DeepSeek R1 paper — Incentivizing Reasoning Capability in LLMs via RL — DeepSeek AI, 2025
- Kimi K2 model card — Moonshot AI / Hugging Face
- GPT-4o announcement — Hello GPT-4o — OpenAI, 2024
- Llama 3.3 70B MMLU score — Meta model card — Meta / Hugging Face
Security
- OWASP Top 10 for Large Language Model Applications — OWASP
- Azure Well-Architected Framework — Security pillar — Microsoft Learn
- NVv4 series retirement — Microsoft Learn
- NCv3 series (V100) retirement — Microsoft Learn