Running LLM Inference on AKS

Most teams running LLMs start with a cloud API. At some point — whether driven by cost, compliance, or latency — the question becomes: should we self-host? And if we do, should we run on a VM or on Kubernetes?

This post answers those questions with specifics. It covers when AKS + GPU inference makes sense, how to choose the right model for your use case, and how to size every layer of the stack: GPU node, pod configuration, and replica count.

When Does GPU Inference on AKS Make Sense?

Option A: Cloud API (Azure OpenAI) No infrastructure. Pay per token. Works immediately. But your prompts transit Microsoft’s inference infrastructure, pricing is per-token at commercial rates, and the GPU capacity is shared with every other tenant.

Option B: Self-hosted on a VM SSH in, run docker pull vllm/vllm-openai, point it at your GPU. Simple. But the VM bills 24/7 regardless of whether you’re inferencing. A T4 VM (NC4as_T4_v3) at ~$0.53/hr costs $380/month running continuously.

Option C: Self-hosted on AKS with NAP + KEDA The GPU node exists only when inference is running. When idle, NAP (Karpenter on AKS) deprovisions the node and GPU billing stops. A workload running 4 hours/day pays ~$50/month instead of $380.

That gap — $330/month per GPU node — is the core economic argument for this stack.

When the API wins

Don’t over-engineer. The cloud API is the right choice when:

Volume is under ~10K requests/day — at low volume, API simplicity beats infrastructure cost
You need GPT-4-class multimodal and no open model matches your quality bar
You have no MLOps capacity — self-hosting requires ~0.25–0.5 FTE to maintain
You need Microsoft’s compliance certifications (SOC 2, HIPAA BAA) without building them yourself

When AKS wins

Data sovereignty — prompts and completions never leave your VNet. This is the deciding factor for HIPAA, PCI-DSS, EU AI Act, and customer contracts that prohibit data leaving your environment. When you call Azure OpenAI, your data transits Microsoft’s inference infrastructure. With a self-hosted model in AKS, it never crosses your VNet boundary.
Cost at volume — the numbers as of 2026:

Model	Input / 1M tokens	Output / 1M tokens
GPT-4o (Azure OpenAI)	$2.50	$10.00
GPT-4o-mini (Azure OpenAI)	$0.15	$0.60
Self-hosted Mistral 7B (1x T4)	~$0.004	~$0.004
Self-hosted Llama 3.3 70B (2x A100)	~$0.025	~$0.025

Self-hosted cost = GPU $/hr ÷ (throughput tok/s × 3,600). Throughput is the key variable — it changes the number significantly.

The break-even formula:

			
Break-even (req/day) =
    fixed_daily_overhead
    ──────────────────────────────────────────────────
    api_cost_per_req  −  selfhost_cost_per_req
where:
  fixed_daily_overhead  = system node pool $/day = $0.37/hr × 24 = $8.88/day
                          ← vLLM Standalone only. GPU node deprovisions at idle.
                          For KAITO: add GPU node cost = $8.88 + $1.20×24 = $37.68/day
  api_cost_per_req      = (input_tokens × $0.15 + output_tokens × $0.60) / 1,000,000
  selfhost_cost_per_req = total_tokens × gpu_$/hr / (throughput_tok_s × 3,600)

		

Worked example — Mistral 7B vs GPT-4o-mini, 500 input + 300 output tokens avg, 3,000 tok/s:

api_cost_per_req      = (500 × $0.15 + 300 × $0.60) / 1,000,000 = $0.000255 / request
selfhost_cost_per_req = 800 × $1.20 / (3,000 × 3,600)           = $0.000089 / request

vLLM Standalone (GPU deprovisioned at idle):
  break_even = $8.88 / ($0.000255 − $0.000089) ≈ 53,500 requests / day

KAITO (GPU node always running while Workspace exists):
  break_even = $37.68 / ($0.000255 − $0.000089) ≈ 227,000 requests / day

Cost curves — vLLM Standalone starts at ~$9/day (system nodes only) and rises slowly; API starts at $0 and rises steeply. KAITO starts at ~$38/day regardless of traffic:

Deployment model determines your cost floor. vLLM Standalone achieves true GPU billing scale-to-zero via NAP — you pay ~$9/day at idle. KAITO sets do-not-disrupt: true on GPU nodes, preventing NAP consolidation while the Workspace CRD exists. Your cost floor with KAITO is ~$38/day regardless of traffic. Use KAITO for workloads with consistent demand; use vLLM Standalone for bursty or dev/test workloads where idle periods are significant.

What shifts the break-even:

Factor	Direction	Why
Higher GPU throughput	Break-even drops	Cheaper self-hosted cost per token
Output-heavy requests	Break-even drops	API charges 4× more for output than input
More expensive API tier (GPT-4o vs mini)	Break-even drops sharply	Larger cost gap per request
KAITO deployment (node always-on)	Break-even rises to ~227K req/day	Fixed daily cost jumps from $8.88 to $37.68
Low throughput (<1,200 tok/s)	No break-even — self-host never wins	Self-host cost per token exceeds API before fixed overhead

Critical caveat: the cost advantage only materializes when the GPU is well-utilized and the GPU node deprovisions during idle (vLLM Standalone). For KAITO, the node runs continuously; the savings come from multi-model sharing and operational simplicity, not idle cost elimination.

KAITO and scale-to-zero — an important nuance: KAITO sets karpenter.sh/do-not-disrupt: "true" on every NodeClaim it creates (nodeclaim.go:151). This blocks NAP consolidation — the GPU node stays running as long as the Workspace CRD exists, even when all replicas are scaled to zero by KEDA. KAITO’s official KEDA integration (docs) scales inference pods only and uses minReplicaCount: 1 in all examples. Community request #306 tracks GPU node scale-to-zero — it has no implementation commitment as of 2026.

do-not-disrupt only blocks voluntary disruption. When the Workspace is deleted, KAITO’s GC finalizer deletes the NodeClaim directly (workspace_gc_finalizer.go), which bypasses the annotation and terminates the node. But this is a slow teardown path (6-8 min to redeploy), not the KEDA replica-scale path.

For true GPU billing scale-to-zero: use vLLM standalone (vllm-standalone.yaml) instead of KAITO. Without the do-not-disrupt annotation, NAP deprovisions the GPU node when KEDA scales replicas to zero. The KAITO model manifests in this repo remain valid for always-on or near-always-on workloads.

Bursty or unpredictable traffic — KEDA scales from zero replicas when demand arrives, and NAP provisions GPU nodes automatically. No pre-provisioned capacity sitting idle between spikes.
Multiple models — running Mistral for customer support and Llama for internal agents on the same cluster means one KAITO Workspace CRD per model, sharing the same system node pool. On VMs it’s manual port juggling.
Customization — fine-tune on your domain data with KAITO’s QLoRA support (a single kubectl apply). Fine-tuning GPT-4o is restricted, more expensive, and the result stays on Microsoft’s infrastructure.
Latency control — dedicated GPU, predictable P95 TTFT, direct control over serving parameters. Cloud API TTFT spikes during tenant peak hours.
No vendor lock-in — model versions get deprecated on the API provider’s schedule. With open weights, you pin the version you tested. It runs forever.

Quality context: As of 2026, Llama 3.3 70B scores 86.0% on MMLU (0-shot, CoT) per the Meta model card, vs GPT-4o’s 88.7% on the same variant per OpenAI’s Hello GPT-4o. For most enterprise tasks, the gap between open-source and proprietary models has effectively closed. A fine-tuned smaller model often outperforms a larger general-purpose one on your specific domain.

VM vs AKS: when the complexity pays off

Dimension	VM (single GPU)	AKS + KAITO + NAP + KEDA
Setup time	Minutes	~10 min first time
GPU billing	24/7, always on	Only while inferencing (scale-to-zero)
Multi-model	Manual port juggling	One KAITO Workspace CRD per model
Scaling to N replicas	Manual	KEDA + NAP handles it
Secrets / auth	.env files, SSH keys	Workload Identity — nothing stored
Cost at idle	Full GPU VM cost	~$0 — node deprovisioned
RBAC	OS-level	Kubernetes RBAC + Azure RBAC

Use a VM when: prototyping a single model, running long fine-tuning jobs that can’t tolerate interruption, or you want zero operational complexity.

Use AKS when: multiple models, bursty traffic, compliance requirements, or you already run other workloads on AKS and want to reuse the cluster.

How to Pick the Right Model

Run through these constraints in order. The first one that applies wins.

Step 1: What is your available VRAM?

VRAM is the hard constraint. Before evaluating quality, check what GPU tier you can access:

			
1x T4  (16 GB)  → Phi-4 Mini, Phi-3 Mini, Mistral 7B (tight), Mistral 7B AWQ
1x A10 (24 GB)  → Mistral 7B fp16 (comfortable), Phi-4 14B
1x A100 (80 GB) → Llama 3.1 8B, Llama 3.3 70B (quantized AWQ)
2x A100 (160 GB)→ Llama 3.3 70B (full fp16 precision)

Step 2: What is your primary task?

Task	Recommended model	Reason
Customer support / chat	Mistral 7B	Fast, cheap, strong instruction following
Code generation	Llama 3.3 70B	Best open-source code quality
Math / STEM / reasoning	Phi-4 Mini	Beats GPT-4o on MATH benchmark (80.4% vs 74.6%)
Long documents / RAG	Phi-3 Mini 128K or Llama 3.3 70B	128K context window
Multi-turn agents / tool use	Llama 3.3 70B	Best open-source tool-use as of 2026
Edge / batch classification	Phi-3 Mini or Llama 3.2 3B	Small, fast, cheap

Step 3: License requirements

License	Models	Restrictions
MIT	Phi family	None — zero ambiguity, no attribution required
Apache 2.0	Mistral 7B / Mixtral	No meaningful restrictions
Llama Community License	Llama 3.x	OK for <700M MAU; cannot be used to build competing foundation models

Step 4: Do you need fine-tuning?

If yes: Mistral 7B or Llama 3.1 8B — most tooling, KAITO QLoRA support, most community resources.

Model comparison

All models listed are available as open weights for self-hosting. Organized by minimum GPU required.

T4 tier — NC4as/NC8as/NC16as_T4_v3 ($0.53–$1.20/hr)

Model	Params	MMLU	License	Azure SKU	Notes
Phi-4 Mini	3.8B	67.3% ⁵	MIT	NC4as_T4_v3	Math/reasoning at T4 budget
Phi-3 Mini 128K	3.8B	69.7% ⁵	MIT	NC4as_T4_v3	128K context on T4; RAG
Gemma 3 4B	4B	~60% ¹	Gemma ToU	NC4as_T4_v3	Native text+image multimodal
Mistral 7B AWQ	7B	60.1% ²	Apache 2.0	NC4as_T4_v3	High-volume chat; fp16 is too tight for T4
Qwen2.5-7B AWQ	7B	74.2% ⁵	Apache 2.0	NC8as_T4_v3	Highest MMLU in T4 tier; 29 languages; KAITO-supported
DeepSeek R1 Distill 7B AWQ	7B	~57% ¹	MIT	NC8as_T4_v3	Chain-of-thought reasoning on T4; beats GPT-4o on MATH-500

Single A100 80GB — NC24ads_A100_v4 ($3.67/hr)

Model	Params	MMLU	License	Notes
Llama 3.1 8B	8B	73.0% ³	Llama Community	General purpose; strong fine-tuning ecosystem
Gemma 3 12B	12B	~74% ¹	Gemma ToU	Multimodal; strong multilingual
Qwen2.5-14B	14B	79.7% ⁵	Apache 2.0	Best mid-tier MMLU; 128K context
DeepSeek R1 Distill 32B AWQ	32B	~78% ¹	MIT	Reasoning beats o1-mini; ~24GB at AWQ int4
Gemma 3 27B	27B	78.6% ⁵	Gemma ToU	Chatbot Arena Elo 1338 — outranks models 10× its size
Mistral Small 3.1/3.2	24B	~80.6%	Apache 2.0	3× throughput vs Llama 70B; 128K context; optional vision

Dual A100 — NC48ads_A100_v4 ($7.34/hr)

Model	Params	MMLU	License	Notes
Llama 3.3 70B	70B	86.0% ³	Llama Community	Best open tool-use and agents
Qwen2.5-72B	72B	86.1% ⁵	Tongyi Qianwen ⁴	Stronger math/code than Llama 70B; multilingual

H100 cluster — ND96isr_H100_v5 (8× H100 SXM 80GB)

Model	Params (total / active)	MMLU	License	Notes
Llama 4 Scout	109B / 17B MoE	~79.6%	Llama Community	10M context; multimodal; single H100 at int4
DeepSeek R1	671B / 37B MoE	90.8%	MIT	Best open reasoning model; 8× H100 at FP8
Kimi K2 / K2.5	1T / 32B MoE	89.5%	Mod. MIT	Top open coding/agents; ~500GB int4; 4–8× H100

¹ Approximate — not officially published as standalone MMLU for these variants. ² Mistral 7B v0.3 has no separately published MMLU score; v0.3 changed only the tokenizer. Figure is from the original v0.1 paper. ³ Instruct model, 0-shot CoT. Base model 5-shot: Llama 3.1 8B = 66.7%; Llama 3.3 70B = 79.3%. ⁴ Tongyi Qianwen license: commercial use permitted with attribution; no meaningful restrictions for standard enterprise deployment. ⁵ 5-shot evaluation unless noted otherwise.

Recommended starting sequence

Qwen2.5-7B or Phi-4 Mini — validate your pipeline on T4 ($0.53–0.75/hr)
Mistral Small 3.1 or Qwen2.5-14B — validate quality at the mid-tier (single A100)
Llama 3.3 70B or Qwen2.5-72B — set your quality ceiling
DeepSeek R1 — if reasoning/math is critical, benchmark this before deciding you need a proprietary model
Azure OpenAI GPT-4o or Claude — if none of the above meets your bar, now you have a concrete comparison point

How to Size the Stack

Getting this order right is critical. Pod config depends on the node. Replica count depends on pod config. Start at node selection.

Step 0: Measure your workload first

Every sizing decision downstream depends on two numbers: p95 prompt tokens and p95 completion tokens. Measure them separately — they matter differently. Prompt tokens drive KV cache prefill and max-model-len; completion tokens drive throughput and generation time.

If you’re already calling Azure OpenAI or the OpenAI API:

Every response includes token counts in the usage field. Log them and compute p95:

			
import numpy as np
# Collect usage objects from your API response logs
# e.g. {"prompt_tokens": 512, "completion_tokens": 287}
samples = [...]  # your logged usage objects
prompt     = [s["prompt_tokens"] for s in samples]
completion = [s["completion_tokens"] for s in samples]
print(f"p95 prompt:     {np.percentile(prompt, 95):.0f} tokens")
print(f"p95 completion: {np.percentile(completion, 95):.0f} tokens")
print(f"p95 total:      {np.percentile([p+c for p,c in zip(prompt,completion)], 95):.0f} tokens")

		

Token counts are also in Azure Monitor → Metrics → Azure OpenAI → Token Transaction, exportable as CSV.

If you’re already running vLLM:

Query the built-in Prometheus histograms:

			
-- p95 prompt tokens (PromQL for Azure Managed Prometheus)
histogram_quantile(0.95, rate(vllm:request_prompt_tokens_bucket[1h]))
-- p95 completion tokens
histogram_quantile(0.95, rate(vllm:request_generation_tokens_bucket[1h]))

Or directly from the metrics endpoint:

			
kubectl port-forward svc/<vllm-service> 8000:8000 -n inference &
curl -s http://localhost:8000/metrics | grep -E 'request_prompt_tokens|request_generation_tokens'

If you’re starting from scratch:

Count tokens on 200–500 representative real prompts using the target model’s tokenizer:

			
from transformers import AutoTokenizer
import numpy as np
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.3")
# Assemble prompts exactly as your app would send them
# (system prompt + user message + any few-shot examples)
prompts = ["You are a helpful assistant.\n\nUser: <real example>", ...]
counts = [len(tokenizer.encode(p)) for p in prompts]
print(f"p50: {np.percentile(counts, 50):.0f}  p95: {np.percentile(counts, 95):.0f}  max: {max(counts)}")

		

For completion tokens, run 100 real requests against a pilot deployment and measure output length. Completion length varies by model and instruction phrasing — you cannot reliably estimate it without running the model.

Starting estimates if you have no data yet:

Use case	p95 prompt	p95 completion
Customer support chat	400–800	150–300
RAG (retrieval + question)	1,500–4,000	200–500
Code generation	500–2,000	500–2,000
Document summarization	4,000–32,000	300–800
Multi-turn agents with tool calls	2,000–8,000	500–2,000

Validate these estimates before locking in GPU tier and max-model-len. A RAG workload sized on customer-support assumptions will OOM under real traffic.

The governing equation

Total VRAM required = Weights + KV Cache + Runtime Overhead (10–20%)

Step 1: Select the GPU node (VRAM calculation)

1a. Weights memory

Model weights load entirely into VRAM before the first token is generated.

			
Weights (GB) = Parameter count (billions) × bytes per parameter
Precision     Bytes/param   Notes
───────────────────────────────────────────────────────────────
fp16          2.0           Default. Works on T4, V100, A100
bfloat16      2.0           Preferred on A100/H100 (better range)
int8          1.0           ~0–2% quality loss
int4 (AWQ)    0.5           ~1–3% quality loss, needs pre-quantized checkpoint

		

Model	Params	fp16 / bf16	int8	int4 (AWQ)
Phi-4 Mini	3.8B	7.6 GB	3.8 GB	1.9 GB
Gemma 3 4B	4B	8.0 GB	4.0 GB	2.0 GB
Mistral 7B	7.3B	14.6 GB	7.3 GB	3.7 GB
Qwen2.5-7B	7B	14.0 GB	7.0 GB	3.5 GB
Llama 3.1 8B	8B	16.0 GB	8.0 GB	4.0 GB
Gemma 3 12B	12B	24.0 GB	12.0 GB	6.0 GB
Qwen2.5-14B	14B	28.0 GB	14.0 GB	7.0 GB
Mistral Small 3.1	24B	48.0 GB	24.0 GB	12.0 GB
Gemma 3 27B	27B	54.0 GB	27.0 GB	13.5 GB
DeepSeek R1 Distill 32B	32B	64.0 GB	32.0 GB	16.0 GB
Llama 3.3 70B	70.6B	141 GB	70.6 GB	35.3 GB
Qwen2.5-72B	72B	144 GB	72.0 GB	36.0 GB
DeepSeek R1 / Kimi K2	671B / 1T	impractical	impractical	~335 GB / ~500 GB

Rule: If weights occupy more than 70% of available VRAM, go up one GPU tier. The remaining 30% is for KV cache + overhead. At 90%+ on weights alone, you will OOM under any real concurrency.

1b. KV Cache memory

KV cache is the part most people underestimate. It grows with the number of concurrent requests, the sequence length, and the model’s attention structure. It can exceed weights memory under load.

Simplified rule of thumb — KV cache per concurrent request per 1K context tokens:

Model	KV per req per 1K tokens (fp16)
Phi-4 Mini	~0.25 MB
Mistral 7B	~0.50 MB
Llama 3.1 8B (GQA)	~0.25 MB
Llama 3.3 70B (GQA)	~1.25 MB

Example — Mistral 7B, 32 concurrent requests, 4K context:

32 requests × 4 (1K blocks) × 0.50 MB = 64 MB  ← negligible

Example — Mistral 7B, 32 concurrent requests, 32K context:

32 × 32 × 0.50 MB = 512 MB  ← still manageable

Long context (32K+) with high concurrency is where KV cache dominates. Use --kv-cache-dtype fp8 to halve it on A100/H100.

1c. GPU selection table

Selection rule: Usable VRAM > Weights × 1.3

Model + dtype	Weights	Min usable VRAM	GPU tier	Azure SKU	$/hr
Phi-4 Mini (fp16)	7.6 GB	9.9 GB	T4 16 GB	NC4as_T4_v3	$0.53
Mistral 7B / Qwen2.5-7B (AWQ)	3.5–3.7 GB	4.6–4.8 GB	T4 16 GB	NC4as_T4_v3	$0.53
Mistral 7B (fp16)	14.6 GB	19.0 GB	T4 too tight — use AWQ	NC16as_T4_v3	$1.20
Qwen2.5-14B (fp16)	28.0 GB	36.4 GB	A100 80 GB	NC24ads_A100_v4	$3.67
Llama 3.1 8B (fp16)	16.0 GB	20.8 GB	A100 80 GB	NC24ads_A100_v4	$3.67
Mistral Small 3.1 (fp16)	48.0 GB	62.4 GB	A100 80 GB	NC24ads_A100_v4	$3.67
Gemma 3 27B (fp16)	54.0 GB	70.2 GB	A100 80 GB	NC24ads_A100_v4	$3.67
DeepSeek R1 Distill 32B (AWQ)	16.0 GB	20.8 GB	A100 80 GB	NC24ads_A100_v4	$3.67
Llama 3.3 70B (fp16)	141 GB	183 GB	2× A100 80 GB	NC48ads_A100_v4	$7.34
Qwen2.5-72B (fp16)	144 GB	187 GB	2× A100 80 GB	NC48ads_A100_v4	$7.34
Llama 4 Scout (int4)	~55 GB	~72 GB	H100 80 GB	ND96isr_H100_v5	~$98
DeepSeek R1 (FP8)	~335 GB	~436 GB	8× H100 80 GB	ND96isr_H100_v5	~$98
Kimi K2/K2.5 (int4)	~500 GB	~650 GB	8× H100 80 GB	ND96isr_H100_v5	~$98

Mistral 7B on T4 warning: weights fill 14.6 GB of 16 GB — only 1.8 GB left for KV cache. You must set max-model-len: 2048 and max-num-seqs: 16 or you will OOM. Mistral 7B int4 AWQ is strongly recommended on T4.

Step 2: Configure the vLLM pod (four parameters)

These four parameters interact — changing one affects the others. Set them in this order.

Parameter 1: gpu-memory-utilization

gpu-memory-utilization: 0.90   # Good default

This controls how much total VRAM vLLM claims at startup for weights + KV cache combined. It does not mean 90% goes to weights — the model loads first, and whatever is left within this fraction becomes KV cache.

			
Increase to 0.95 → more KV cache → higher concurrency or longer context
Decrease to 0.85 → if you get random OOMKilled under moderate load
Never use 1.0   → CUDA needs a reservation for kernels and activations

Parameter 2: max-model-len — your real throughput knob

This sets the ceiling on total tokens per request (input + output). It directly controls how much KV cache each request can consume.

The most common mistake: setting max-model-len: 131072 when your app sends 500-token prompts and expects 300-token responses. This wastes enormous KV cache reservation.

			
Recommended formula:
  max-model-len = 2 × p95(actual prompt tokens + completion tokens)
Example:
  p95 prompt:     500 tokens
  p95 completion: 300 tokens
  → max-model-len: 2048  (not 128K just because the model supports it)

		

Effect of halving max-model-len — you roughly double concurrent capacity:

			
Mistral 7B on T4, remaining VRAM for KV = 1.8 GB:
  max-model-len: 4096 → ~14 concurrent sequences
  max-model-len: 2048 → ~28 concurrent sequences
  max-model-len: 1024 → ~56 concurrent sequences

Parameter 3: max-num-seqs — concurrency ceiling

max-num-seqs: 64   # Maximum concurrent sequences in the vLLM scheduler

When this ceiling is hit, new requests queue and wait. Starting formula:

			
max-num-seqs = floor(available_kv_vram_GB / kv_per_seq_at_max_model_len_GB)
Example — Phi-4 Mini on T4:
  Weights:           7.6 GB
  Available for KV:  16 × 0.90 − 7.6 = 6.8 GB
  KV per seq @ 4K:   4 × 0.25 MB = 1 MB
  → floor(6.8 / 0.001) ≈ 6,800  (GPU-limited, not formula-limited)
  → Start at 64, validate with load test

		

In practice: start at 32–64, run a load test, watch vllm:gpu_cache_usage_perc in Prometheus. Increase until it hits 85–90% under expected peak load.

Parameter 4: dtype

			
dtype: "float16"    # T4, V100 — no native bfloat16
dtype: "bfloat16"   # A100, H100 — better numerical stability, same memory

Reference configs by GPU tier

			
# T4 16 GB — Phi-4 Mini (comfortable)
vllm:
  gpu-memory-utilization: 0.90
  max-model-len: 8192
  max-num-seqs: 64
  dtype: "float16"
  enable-prefix-caching: true
# T4 16 GB — Mistral 7B fp16 (tight — reduce if OOM)
vllm:
  gpu-memory-utilization: 0.92
  max-model-len: 2048          # Conservative on T4 for Mistral fp16
  max-num-seqs: 16
  dtype: "float16"
# T4 16 GB — Mistral 7B int4 AWQ (recommended on T4)
vllm:
  gpu-memory-utilization: 0.90
  max-model-len: 8192
  max-num-seqs: 64
  dtype: "float16"
  quantization: "awq"
# V100 16 GB — Llama 3.1 8B
vllm:
  gpu-memory-utilization: 0.95
  max-model-len: 4096
  max-num-seqs: 32
  dtype: "float16"
  enable-prefix-caching: true
# A100 80 GB — Llama 3.1 8B (lots of headroom)
vllm:
  gpu-memory-utilization: 0.90
  max-model-len: 32768
  max-num-seqs: 256
  dtype: "bfloat16"
  enable-prefix-caching: true
  enable-chunked-prefill: true
# 2x A100 80 GB — Llama 3.3 70B
vllm:
  gpu-memory-utilization: 0.90
  max-model-len: 8192
  max-num-seqs: 64
  dtype: "bfloat16"
  tensor-parallel-size: 2
  enable-prefix-caching: true

		

Additional flags worth knowing:

			
enable-prefix-caching: true
# Reuses KV cache for identical prompt prefixes across requests.
# Major win for chatbots (same system prompt) and RAG (same retrieval context).
# No cost — enable by default.
enable-chunked-prefill: true
# Processes long prompts in chunks instead of one shot.
# Prevents a long prefill from starving concurrent short requests.
# Use when you mix short and long prompts.
kv-cache-dtype: "fp8"
# Halves KV cache memory vs fp16.
# Allows 2x more concurrent sequences for the same VRAM.
# Requires A100/H100. ~0.5% quality degradation.
swap-space: 4
# CPU RAM (GB) vLLM can use to offload KV cache blocks when VRAM is full.
# Acts as a spillover buffer. Set to 4–16 GB on nodes with large system RAM.

		

Step 3: Pod resource requests

			
resources:
  requests:
    nvidia.com/gpu: "1"    # Always 1 per vLLM pod.
                           # vLLM owns the GPU exclusively — never share.
    cpu: "4"               # Tokenizer + HTTP server: 2–4 cores sufficient.
                           # GPU is the bottleneck, not CPU.
    memory: "24Gi"         # Weights load through CPU RAM first, then copy to VRAM.
                           # Set to ~1.5× weights size.
  limits:
    nvidia.com/gpu: "1"    # Prevents accidental multi-GPU scheduling.
    cpu: "8"               # Allow burst for prefill spikes.
    memory: "32Gi"         # Headroom for Python runtime + buffers.

		

Why CPU and RAM are not your bottleneck: vLLM’s hot path (attention, sampling) runs entirely on GPU. The CPU handles HTTP parsing, tokenization, and scheduling — lightweight tasks that rarely exceed 2–3 cores even at high throughput. Over-provisioning CPU doesn’t help. Over-provisioning GPU does.

Step 4: Replica count and KEDA alignment

A single vLLM pod owns one GPU and processes one batch at a time. Horizontal scaling (more pods, more GPU nodes via NAP) is the primary way to increase total throughput.

			
Required replicas = ceil(peak_concurrent_users / max-num-seqs)
Example — Mistral 7B on T4, max-num-seqs = 32:
  Peak concurrent users: 200
  → ceil(200 / 32) = 7 replicas = 7 GPU nodes (NC16as_T4_v3)
  → Cost at peak: 7 × $1.20/hr = $8.40/hr
  → Cost at idle: $0 — NAP deprovisions all 7 nodes

		

For bursty workloads, size for p95 concurrency, not the all-time peak. Let KEDA handle spikes up to maxReplicaCount.

KEDA ScaledObject alignment — your max-num-seqs should inform your KEDA threshold:

			
triggers:
  - type: prometheus
    metadata:
      query: avg(vllm:num_requests_waiting{namespace="inference"})
      threshold: "5"           # Add a replica when avg waiting > 5
      activationThreshold: "1"
minReplicaCount: 1             # Keep 1 warm to avoid cold-start on first request
maxReplicaCount: 7             # ceil(200 peak users / 32 max-num-seqs)

		

Scaling decision guide:

Signal	Action
`vllm:num_requests_waiting > 0` consistently	Add replicas
`vllm:gpu_cache_usage_perc > 90%`	Reduce `max-num-seqs` OR add replicas
GPU utilization < 40% at peak	Pod/GPU oversized — go down a tier
OOMKilled	Reduce `max-num-seqs` or `max-model-len`
TTFT > SLO at low concurrency	GPU too slow — go up one tier

Quick reference: all parameters together

Model	GPU	max-model-len	max-num-seqs	dtype	tensor-parallel
Phi-4 Mini	T4 16GB	8192	64	float16	1
Phi-3 Mini 128K	T4 16GB	8192	32	float16	1
Mistral 7B (fp16)	T4 16GB	2048	16	float16	1
Mistral 7B (AWQ)	T4 16GB	8192	64	float16	1
Llama 3.1 8B	V100 16GB	4096	32	float16	1
Llama 3.3 70B	2x A100 80GB	8192	64	bfloat16	2

Cost Awareness

Component	When billed	Approx. cost
System node pool (D4ds_v5 ×2)	Always	~$0.37/hr total
NC4as_T4_v3 (Phi-4 Mini)	Only when NAP provisions	~$0.53/hr
NC16as_T4_v3 (Mistral 7B)	Only when NAP provisions	~$1.20/hr
NC6s_v3 (Llama 3 8B)	Only when NAP provisions	~$0.90/hr
NC24ads_A100_v4 (Llama 3 70B)	Only when NAP provisions	~$3.67/hr per node

NAP deprovisions GPU nodes after 2 minutes of idle. A dev/test workflow running occasional requests pays for GPU time only while actively inferencing — often under $5/day.

Architecture Overview

Component Deep Dives

KEDA — Kubernetes Event-Driven Autoscaling

What problem it solves: Standard HPA scales on CPU/memory — meaningless for LLM inference where GPUs are the bottleneck and requests arrive unpredictably. KEDA watches external event sources and scales Deployments based on real demand signals.

The scale-to-zero trick: HPA requires at least one running pod to collect metrics. KEDA bypasses this by monitoring event sources directly from its operator — no running pod needed. When demand arrives, it sets replicas from 0 → 1 before HPA ever gets involved.

Three trigger modes in this lab:

Trigger	File	Best For
HTTP Add-on	`keda/1-http-scaledobject.yaml`	Synchronous inference API; buffers requests while pods cold-start
Service Bus Queue	`keda/2-servicebus-scaledobject.yaml`	Async batch inference; message durability; decoupled producers
Azure Managed Prometheus	`keda/3-prometheus-scaledobject.yaml`	React to GPU utilization or vLLM internal queue depth

HTTP Add-on internals: The HTTP add-on installs an interceptor proxy (2 replicas) that sits in front of your Service. All traffic routes through it. When a request arrives and the target deployment has 0 replicas, the proxy holds the connection open, signals KEDA to scale up, and forwards the request once the pod is ready. This is transparent to the client — they just see extra latency on the first request.

HTTP Add-on production caveats:

Always-on cost: the 2 interceptor replicas run continuously regardless of inference traffic (~$0.35/hr on Standard_D2ds_v5). This is not included in scale-to-zero savings calculations and is the minimum cost floor for HTTP-triggered scaling.
Cold-start timeout: the proxy has a finite wait timeout. If NAP provisioning
- container pull + model load exceeds it, the client receives a 503 even if the pod eventually becomes ready. Set scaledownPeriod high enough that the GPU node stays warm between requests during active usage periods.
Long-generation workloads: for models that generate responses taking >60s (e.g. Llama 3.3 70B on complex prompts), use the Service Bus trigger instead — it provides durable buffering with no proxy timeout constraint.

Key tuning parameters:

			
cooldownPeriod: 120     # Seconds of idle before scaling to zero.
                        # For LLMs: set higher than your longest generation.
                        # Killing a pod mid-generation loses the response.
pollingInterval: 15     # How often KEDA queries the trigger source.
                        # Lower = faster reaction, more API calls to Azure.
activationThreshold: 1  # Queue depth that triggers scale from 0 → 1.
                        # Keep at 1 for interactive use cases.
threshold: 5            # Target metric value per replica.
                        # "Add a replica per 5 queued messages" or
                        # "Add a replica when GPU util > 70%"

		

Authentication (no secrets stored): KEDA’s TriggerAuthentication uses azure-workload provider. The KEDA operator ServiceAccount is federated with an Azure managed identity (via Terraform). It exchanges its OIDC token for a scoped AAD token at query time. Connection strings never touch etcd.

NAP — Node Auto Provisioning (Karpenter on AKS)

Preview status: NAP is in public preview as of early 2026. Microsoft’s support boundary for preview features differs from GA — it is not covered by an SLA and breaking changes may occur. Verify current status at aka.ms/aks-nap before using in production. GPU SKU availability varies significantly by region — NC4as_T4_v3 is broadly available but H100 SKUs are quota-limited in most regions. Request quota at aka.ms/AzureGPUQuota before designing for specific GPU families.

Spot GPU instances: Azure spot pricing reduces GPU costs by ~75–80% (NC4as_T4_v3: ~$0.10/hr vs $0.53/hr on-demand). The lab includes a spot NodePool (gpu-inference-spot) for async/batch workloads. Do not use spot for synchronous HTTP inference or KAITO workloads. KAITO’s do-not-disrupt annotation blocks Karpenter’s voluntary consolidation but does not block Azure spot eviction (an involuntary interruption) — the GPU node will still be evicted mid-inference. Spot is safe for Service Bus queue workers where jobs requeue on failure. See manifests/nap/gpu-nodepool.yaml for the two-NodePool setup (on-demand primary, spot secondary).

What problem it solves: Classic AKS cluster autoscaler requires pre-created node pools with fixed VM sizes. If you don’t have a GPU node pool, GPU-requesting pods stay Pending forever. NAP replaces this with a Karpenter-based controller that analyzes each pending pod’s resource requirements and dynamically provisions the optimal VM.

How selection works:

			
Pending pod requests:
  nvidia.com/gpu: 1
  memory: 16Gi
  cpu: 4
NAP evaluates NodePool requirements:
  sku-family: NC
  sku-name: [NC4as_T4_v3, NC8as_T4_v3, NC16as_T4_v3, NC6s_v3, NC24ads_A100_v4]
  capacity-type: on-demand
NAP picks the cheapest SKU that fits all requests:
  → Standard_NC4as_T4_v3 (1x T4, 28GiB RAM, 4 vCPU) wins
  → VM provisions, joins cluster, pod schedules

		

GPU node lifecycle:

			
Pod pending → NAP provisions node (3-5 min)
Pod running → model loads → inference starts
Pod completed / scaled to 0 → node idle
consolidateAfter: 2m → NAP deprovisions node
GPU billing stops

		

Important: KAITO sets karpenter.sh/do-not-disrupt: "true" on every NodeClaim it creates (source). This blocks Karpenter’s voluntary disruption (consolidation) — the GPU node stays alive as long as the Workspace CRD exists, even when replicas are scaled to zero by KEDA. do-not-disrupt only blocks voluntary disruption; when the Workspace is deleted, KAITO’s GC finalizer deletes the NodeClaim directly (workspace_gc_finalizer.go), which bypasses the annotation and terminates the node. KAITO’s official KEDA integration (v0.8.0+) scales pods only and uses minReplicaCount: 1 in all examples — GPU node scale-to-zero is not supported (issue #306). See KAITO vs vLLM Standalone below.

Key CRDs in this lab:

NodePool (Karpenter API) — constraints: GPU SKU families, capacity type (spot vs on-demand), architecture, taints
AKSNodeClass (Azure extension) — VNet/subnet ID, OS disk size, image family

GPU taint/toleration pattern:

			
# NodePool applies this taint to every GPU node it provisions:
taints:
  - key: nvidia.com/gpu
    value: "true"
    effect: NoSchedule
# Pods must declare this toleration to land on GPU nodes:
tolerations:
  - key: nvidia.com/gpu
    operator: Equal
    value: "true"
    effect: NoSchedule

		

This ensures CPU workloads never accidentally schedule onto expensive GPU VMs.

Cost guard:

			
limits:
  nvidia.com/gpu: "8"   # Hard cap: NAP won't provision beyond 8 GPUs total.
                        # Without this, a misconfigured workload can exhaust
                        # your entire Azure GPU quota.

KAITO — Kubernetes AI Toolchain Operator

What problem it solves: Deploying an LLM on Kubernetes without KAITO requires: knowing the right GPU SKU, writing vLLM startup args, configuring tensor parallelism, managing GPU driver plugin DaemonSets, writing readiness probes tuned to 2-minute model load times, and more. KAITO wraps all of this into a single 15-line Workspace CRD.

What KAITO does when you kubectl apply a Workspace:

			
Reads the Workspace spec (instanceType, preset name)
Validates GPU SKU has enough VRAM for the model
Creates a Deployment with correct GPU requests + tolerations
Creates a ConfigMap with vLLM startup arguments
Creates a ClusterIP Service named after the workspace
Monitors the Deployment → updates Workspace status conditions

		

NAP provisions the GPU node in parallel (step 3 triggers it).

NC6s_v3 (V100) is an older GPU generation being progressively retired from Azure regions. Verify availability in your target region before depending on it. If unavailable, NC8as_T4_v3 is the recommended alternative for Llama 3.1 8B with quantization (AWQ/GPTQ reduces VRAM requirement to ~10 GB).

Preset model matrix in this lab:

KAITO Preset	File	Min GPU	Min VRAM	Approx GPU VM
`phi-4-mini-instruct`	workspace-phi4-mini.yaml	1x T4	8 GB	NC4as_T4_v3
`phi-3-mini-128k-instruct`	workspace-phi3-mini.yaml	1x T4	10 GB	NC8as_T4_v3
`mistral-7b-instruct`	workspace-mistral-7b.yaml	1x T4	14 GB	NC16as_T4_v3
`llama-3.1-8b-instruct`	workspace-llama3-8b.yaml	1x V100	16 GB	NC6s_v3
`llama-3.3-70b-instruct`	workspace-llama3-70b.yaml	2x A100	160 GB	2x NC24ads_A100_v4

vLLM ConfigMap tuning: KAITO passes inference config via a ConfigMap referenced in the Workspace. Key vLLM parameters for LLM workloads:

			
vllm:
  gpu-memory-utilization: 0.90  # Fraction of VRAM reserved for KV cache.
                                 # Higher = more context/batch. Leave 10% margin.
  max-model-len: 4096            # Maximum sequence length (input + output).
                                 # Reduce to fit in VRAM if OOM.
  max-num-seqs: 64               # Max concurrent sequences in the scheduler.
                                 # Each sequence consumes KV cache memory.
  dtype: "float16"               # T4/V100: use float16. A100/H100: use bfloat16.
  enable-prefix-caching: true    # Cache KV for repeated system prompts.
                                 # Big win for chatbot workloads (same system prompt).

		

KAITO vs vLLM Standalone:

	KAITO	vLLM Standalone
Model packaging	Pre-built MCR images — no HuggingFace token needed	Pull from HuggingFace or your own registry
GPU validation	Validates VRAM before scheduling	Fails at runtime (OOM)
Multi-node (70B+)	Handled automatically (Ray topology)	Manual Ray configuration
vLLM version control	Pinned to KAITO release	Any version
True GPU scale-to-zero	✗ — `do-not-disrupt` pins the node	✓ — NAP deprovisions freely
Cold start (node warm)	Fast — image cached on node	Fast — image cached on node

Use KAITO for always-on or near-always-on workloads (minReplicaCount: 1), multi-node large models, or when you want preset GPU validation with minimal YAML.

Use vLLM Standalone (manifests/vllm/vllm-standalone.yaml) when true GPU billing scale-to-zero is required — bursty workloads, dev/lab environments, or any scenario where the GPU should deprovision during idle periods. Also use standalone for custom LoRA adapters, quantized (GGUF/AWQ) weights, or a vLLM version newer than what KAITO packages.

Checking workspace status:

			
kubectl get workspace -n inference
# NAME                    INSTANCE             RESOURCEREADY   INFERENCEREADY   WORKSPACEREADY
# workspace-phi4-mini     Standard_NC4as_T4_v3 True            True             True
kubectl describe workspace workspace-phi4-mini -n inference
# Look at the Conditions section for detailed status

		

vLLM — OpenAI-Compatible Inference Server

Why vLLM (not TGI, Ollama, etc.):

PagedAttention: manages KV cache as virtual memory pages → higher throughput
Continuous batching: processes multiple requests in parallel without waiting
OpenAI API compatibility: drop-in replacement for GPT-4 clients (no SDK change)
Tensor parallelism: split a model across multiple GPUs in one line (--tensor-parallel-size 2)
Prefix caching: reuse KV cache for repeated system prompts (significant for chatbots)

OpenAI-compatible endpoints:

			
POST /v1/chat/completions    — ChatGPT-style multi-turn conversation
POST /v1/completions         — Legacy text completion
GET  /v1/models              — Lists available models
GET  /health                 — Readiness probe endpoint
GET  /metrics                — Prometheus metrics (queue depth, TTFT, throughput)

		

Cold-start latency breakdown:

Phase	Duration	Notes
NAP VM provision	3-5 min	Only if no GPU node available
Container pull	1-2 min	vLLM image ~8GB; faster after first pull
Model download	2-10 min	From HuggingFace; cached in PVC after first run
Model load to VRAM	30-120s	Proportional to model size
vLLM ready	~10s	After model loaded

Use a PVC for model caching (see vllm-standalone.yaml). Without it, every pod restart re-downloads the full model. With it, cold start goes from 10+ minutes to under 2 minutes after the first run.

Workload Identity

Why not connection strings or Kubernetes Secrets: Secrets stored in etcd are base64-encoded (not encrypted) by default. Rotation requires a redeployment. If your etcd backup leaks, all secrets leak. Workload Identity eliminates the problem entirely.

The OIDC federation chain:

			
Kubernetes Pod
  ↓ ServiceAccount projected token (JWT, short-lived, in /var/run/secrets/)
Azure Workload Identity Webhook
  ↓ injects AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_FEDERATED_TOKEN_FILE
Azure AD
  ↓ validates OIDC token against the cluster's OIDC issuer URL
  ↓ checks subject matches the federated credential (system:serviceaccount:ns:sa)
  ↓ issues an AAD access token scoped to the requested resource
Azure Resource (Key Vault / Service Bus / Foundry)
  ↓ validates AAD token → grants access

		

Three identities in this lab:

Identity	Used By	Permissions
`kaito-identity`	KAITO GPU provisioner	Contributor on AKS cluster (to provision nodes)
`keda-identity`	KEDA operator	Monitoring Data Reader (Prometheus), Service Bus Data Owner
`workload-identity`	Inference pods	Key Vault Secrets User, Service Bus Data Sender/Receiver

Secrets Store CSI Driver: Mounts Key Vault secrets as files inside pods at /mnt/secrets/. Combined with the secretObjects block in SecretProviderClass, secrets are also mirrored as Kubernetes Secret objects (for workloads that read from env vars).

			
# Store your HuggingFace token in Key Vault (required for Llama 3):
az keyvault secret set \
  --vault-name <KEY_VAULT_NAME> \
  --name hf-token \
  --value "hf_xxxxxxxxxxxxxxxxxxxx"

		

Directory Structure

			
aks-ai-lab/
├── terraform/
│   ├── main.tf                    # AKS + NAP + KAITO + KEDA + Key Vault + Service Bus
│   ├── variables.tf
│   ├── outputs.tf
│   └── terraform.tfvars.example
│
├── manifests/
│   ├── nap/
│   │   └── gpu-nodepool.yaml      # Karpenter NodePool + AKSNodeClass for GPU nodes
│   │
│   ├── kaito/
│   │   ├── namespace.yaml
│   │   ├── workspace-phi4-mini.yaml      # Cheapest: T4 16GB
│   │   ├── workspace-phi3-mini.yaml      # 128K context: T4 16GB
│   │   ├── workspace-mistral-7b.yaml     # Balanced: T4 16GB
│   │   ├── workspace-llama3-8b.yaml      # Quality: V100 16GB
│   │   └── workspace-llama3-70b.yaml     # Premium: 2x A100 80GB
│   │
│   ├── keda/
│   │   ├── 1-http-scaledobject.yaml      # Scale on HTTP request concurrency
│   │   ├── 2-servicebus-scaledobject.yaml # Scale on Service Bus queue depth
│   │   └── 3-prometheus-scaledobject.yaml # Scale on GPU util / vLLM queue
│   │
│   ├── workload-identity/
│   │   ├── serviceaccount.yaml           # Federated SA for inference pods
│   │   ├── secret-provider-class.yaml    # Key Vault → pod file mounts
│   │   └── keda-trigger-auth.yaml        # KEDA → Azure auth (no secrets)
│   │
│   ├── vllm/
│   │   └── vllm-standalone.yaml          # Direct vLLM deployment (non-KAITO)
│   │
│   ├── ingress/
│   │   ├──  1-app-routing.yaml            # AKS App Routing add-on (NGINX) — lab/dev
│   │   ├── 2-app-gateway-containers.yaml # Application Gateway for Containers — production
│   │   ├── 3-istio-gateway.yaml          # Istio ingress + VirtualService — production
│   │   └── 4-inference-extension.yaml    # Gateway API Inference Extension — multi-replica
│   │
│   └── monitoring/
│       └── dcgm-exporter.yaml            # NVIDIA GPU metrics DaemonSet
│
├── tests/
│   ├── TESTING.md                    # Test guide — what each test validates
│   ├── 00-run-all-tests.sh           # Run full test suite
│   ├── 01-test-endpoint.sh           # vLLM API surface validation
│   ├── 02-test-keda-scaling.sh       # Scale-up / scale-down lifecycle
│   ├── 03-test-nap-lifecycle.sh      # GPU node provision / deprovision
│   ├── 04-test-load.sh               # Throughput / concurrency benchmark
│   └── 05-test-workload-identity.sh  # OIDC → AAD → Key Vault chain
│
├── docs/
│   ├── sizing-guide.md               # Node / pod / replica sizing formulas
│   └── ingress-guide.md              # Ingress options, manifests, decision guide
│
└── scripts/
    ├── 00-prereqs.sh   # Tool versions, GPU quota, feature flag check
    ├── 01-deploy.sh    # terraform apply + helm installs + namespace setup
    ├── 02-deploy-model.sh # kubectl apply a KAITO workspace + watch status
    └── 03-smoke-test.sh   # port-forward + OpenAI API test + KEDA status

		

Quickstart

Prerequisites

Azure subscription with NC-series GPU quota (request at https://aka.ms/AzureGPUQuota)
Tools: az, kubectl, helm, terraform, jq

1. Clone and configure

			
git clone <your-repo> aks-ai-lab
cd aks-ai-lab
cp terraform/terraform.tfvars.example terraform/terraform.tfvars
# Edit terraform.tfvars — set subscription_id and location

2. Check prerequisites

			
chmod +x scripts/*.sh
./scripts/00-prereqs.sh

3. Deploy infrastructure

			
./scripts/01-deploy.sh
# Takes ~10 minutes. Creates AKS cluster with NAP, KAITO, KEDA add-ons,
# Key Vault, Service Bus, managed identities, federated credentials.

4. Store secrets in Key Vault

			
# Required for Llama 3 (gated model). Optional for Phi/Mistral.
az keyvault secret set --vault-name <KV_NAME> --name hf-token --value "hf_xxx"
az keyvault secret set --vault-name <KV_NAME> --name foundry-api-key --value "xxx"

5. Deploy a model

			
# Start with Phi-4 Mini — fastest and cheapest (T4 GPU, ~$0.50/hr)
./scripts/02-deploy-model.sh phi4-mini
# Or deploy directly:
kubectl apply -f manifests/kaito/workspace-phi4-mini.yaml
# Watch NAP provision the GPU node:
kubectl get nodes -w
# NAME                                   STATUS   ROLES   AGE   VERSION
# aks-system-xxxxx                       Ready    agent   10m   v1.31
# (after 3-5 min):
# aks-nc4ast4v3-xxxxx                    Ready    agent   1m    v1.31  ← GPU node!

		

6. Apply KEDA scaling

			
# Update placeholders in the KEDA manifests first:
export SB_NS=$(terraform -chdir=terraform output -raw servicebus_namespace)
sed -i "s|<SERVICEBUS_NAMESPACE>|$SB_NS|g" manifests/keda/2-servicebus-scaledobject.yaml
export AMW=$(az monitor account list -g rg-aks-ai-lab --query '[0].metrics.prometheusQueryEndpoint' -o tsv)
sed -i "s|<AMW_ENDPOINT>|$AMW|g" manifests/keda/3-prometheus-scaledobject.yaml
kubectl apply -f manifests/keda/

		

7. Run smoke test

./scripts/03-smoke-test.sh phi4-mini

Useful Commands

			
# Watch workspace status
kubectl get workspace -n inference -w
# Check which GPU node NAP provisioned
kubectl get nodes -l karpenter.azure.com/sku-family=NC
# Watch KEDA scaling decisions
kubectl get scaledobject -n inference
kubectl describe scaledobject inference-sb-scaler -n inference
# Check GPU utilization inside a pod
kubectl exec -n inference <pod-name> -- nvidia-smi
# View vLLM metrics (port-forward first)
kubectl port-forward svc/workspace-phi4-mini 5000:5000 -n inference &
curl http://localhost:5000/metrics | grep vllm
# Force scale-to-zero (test cold-start)
kubectl scale deployment workspace-phi4-mini -n inference --replicas=0
# Send a Service Bus message (triggers KEDA scale-up)
az servicebus queue message send \
  --resource-group rg-aks-ai-lab \
  --namespace-name <SB_NAMESPACE> \
  --name inference-requests \
  --body '{"model":"phi-4-mini-instruct","messages":[{"role":"user","content":"Hello"}]}'
# Tear down everything (NAP deprovisions GPU nodes automatically)
cd terraform && terraform destroy

		

Troubleshooting

Workspace stuck in `Pending`

			
kubectl describe workspace workspace-phi4-mini -n inference
kubectl get events -n inference --sort-by=.lastTimestamp
# Common causes:
# 1. GPU quota exhausted → request quota increase
# 2. NAP NodePool limits reached → increase limits in gpu-nodepool.yaml
# 3. Feature flags not registered → re-run 00-prereqs.sh

		

Pod OOMKilled

			
kubectl describe pod <pod-name> -n inference
# Reduce max-model-len or max-num-seqs in the KAITO ConfigMap.
# Or upgrade to a larger GPU SKU in the Workspace instanceType.

KEDA not scaling

			
kubectl describe scaledobject inference-sb-scaler -n inference
# Check: "READY" = True, "ACTIVE" = True/False
# Common causes:
# 1. TriggerAuthentication misconfigured (wrong client ID)
# 2. KEDA identity missing role on Service Bus / Prometheus
# 3. Service Bus queue name mismatch

		

NAP not provisioning GPU nodes

			
kubectl get nodepool gpu-inference -o yaml
# Check: limits not exceeded, SKU family allowed in requirements
kubectl logs -n kube-system -l app=karpenter --tail=50

Ingress & Traffic Architecture

Ingress for LLM inference is not just a routing problem. It sits at the intersection of network security, API governance, cost control, and GPU utilization. A Kubernetes Ingress object alone addresses none of those.

Network Topology

Azure Front Door

Front Door sits on Microsoft’s anycast network with 200+ points of presence globally. It does three things you can’t skip for LLM:

WAF — LLM endpoints are expensive to abuse. A single request generating 100K tokens costs real money. Without a WAF, anyone who discovers your endpoint runs up your GPU bill. Front Door’s WAF blocks OWASP attacks, bot traffic, and rate-limits at the edge before requests ever reach your infrastructure.
DDoS protection — volumetric attacks are absorbed at the edge, not at your origin.
Long-connection handling — LLM responses take 30–120 seconds for long generations. Front Door manages the client TCP connection and streaming response buffering, so your backend doesn’t have to worry about client timeouts on slow 4G connections.

Azure API Management

Standard request-count rate limiting is useless for LLM. 1,000 requests of 5 tokens each costs nothing. One request of 500K tokens can cost dollars of GPU time. APIM is the only layer that enforces limits on actual token consumption:

			
<llm-token-limit
  counter-key="@(context.Subscription.Id)"
  tokens-per-minute="10000"
  token-quota="5000000"
  token-quota-period="Monthly" />

		

Beyond rate limiting:

Per-consumer quotas — different teams get different token budgets. Finance gets 10M tokens/month, a dev team gets 500K. Without this, one team can exhaust your GPU capacity and affect everyone.
AAD authentication — verifies the caller’s identity before the request reaches the cluster. No anonymous calls to your GPU.
Cost chargeback — logs tokens consumed per subscription/consumer to Application Insights. This is how you bill back to internal teams or external customers.
Response caching — identical prompts never hit the GPU. Huge win for RAG workloads where 50 users ask the same question against the same document.
Azure OpenAI fallback — if vLLM returns 503 (cold-starting, NAP provisioning a GPU node), APIM automatically falls back to Azure OpenAI API. The client sees no interruption, just a slightly slower response.

App Gateway for Containers

A standard Kubernetes LoadBalancer service operates at Layer 4 (TCP). It has no understanding of HTTP — no routing based on headers, no health checks based on HTTP response codes, no connection draining.

App Gateway for Containers (AGfC) is managed Envoy running outside your cluster. Azure operates the data plane — no CPU or memory overhead on your nodes. What it adds:

Connection draining — when vLLM is scaling down (KEDA setting replicas to 0), AGfC stops sending new requests to that pod and waits for in-flight requests to finish. Without this, scaling down kills active generations mid-response.
HTTP/2 and gRPC — vLLM supports both. A Layer 4 LB passes them through blindly; AGfC routes them intelligently.
Health-based routing — routes only to pods that return 200 on /health, not just pods that have a running TCP listener. A vLLM pod that’s still loading a 70B model will pass TCP health checks but not HTTP health checks.

Istio

Without Istio, any pod in the cluster that can reach the vLLM Service can call it directly. You have no encryption, no access control, and no observability below the HTTP layer.

mTLS — all pod-to-pod traffic is encrypted and mutually authenticated using short-lived certificates. Only pods with the right ServiceAccount identity can call vLLM. This is the only way to enforce zero-trust inside the cluster.
Circuit breakers — LLM pods can get stuck: model loading takes 2–5 minutes, and during that time the pod accepts connections but never responds. Istio’s circuit breaker detects this (response time exceeds threshold, error rate spikes) and stops routing to that pod, giving it time to recover instead of queuing 500 requests against a broken pod.
Distributed tracing — each request gets a trace ID propagated end-to-end. When a user reports “my request took 90 seconds”, you can see: 2s in AGfC → 3s in Istio routing → 85s in vLLM generation. Without tracing, you’re guessing where the latency is.
Retry policies — if a request hits a pod still initializing (returns 503), Istio retries automatically against a healthy pod. The client never sees the 503.

Gateway API Inference Extension

Standard Kubernetes load balancing is round-robin — it has no knowledge of what each vLLM pod is actually doing. This is a major performance problem for LLM specifically because of KV cache.

vLLM stores the KV cache (computed attention values for input tokens) in GPU memory. If your system prompt is 2,000 tokens and all 50 users of your chatbot share the same system prompt, any pod that has already processed that system prompt has the KV values cached. If the next request for that user hits a different pod, the pod has to recompute 2,000 tokens from scratch — wasted GPU compute.

The Inference Extension solves this with two mechanisms:

Prefix-hash routing — hashes the prefix of the prompt (system prompt + conversation history) and routes to the pod most likely to have that prefix in its KV cache. For chatbot workloads where every user shares the same system prompt, cache hit rates go from near-zero to 80%+. This translates directly to 2–4× throughput on the same hardware.
Load-aware routing — reads vLLM’s Prometheus metrics (queue depth, KV cache utilization, running requests) and routes new requests to the pod with the most available capacity. Standard round-robin ignores this — a pod with 50 queued requests gets the same weight as a pod with 0.

What This Lab Omits: Firewall

The lab uses a single-spoke VNet design. In production you typically add a hub VNet with a Firewall sitting between the public edge and your workloads:

Internet → AFD (WAF) → Firewall (hub) → APIM (spoke) → AKS (spoke)

The firewall gives you centralized egress logging (every pod outbound connection), threat intelligence filtering, and spoke-to-spoke isolation when multiple teams share the same landing zone. Without it, a compromised pod can reach any internet destination.

For production workloads with compliance requirements or multi-team AKS clusters, it becomes non-negotiable.

GitHub Repository

The lab repository includes Terraform for all infrastructure, Kubernetes manifests for every component, five test scripts validating the full lifecycle from API surface to KEDA scale-up/down to NAP node lifecycle, a sizing guide with the full VRAM formulas, and an ingress guide covering the production network topology:

https://github.com/rtrentinavx/aks-ai-lab

References