Running LLM Inference on AKS
Most teams running LLMs start with a cloud API. At some point — whether driven by cost, compliance, or latency — the question becomes: should we self-host? And if we do, should we run on a VM or on Kubernetes? This post answers those questions with specifics. It covers when AKS + GPU inference makes sense, how to choose the right model for your use case, and how to size every layer of the stack: GPU node, pod configuration, and replica count. When Does GPU Inference on AKS Make Sense? Option A: Cloud API (Azure OpenAI) No infrastructure. Pay per token. … Continue reading Running LLM Inference on AKS
