Self-Hosted LLMOps

When you call Azure OpenAI or the OpenAI API, most of the operational surface disappears: Microsoft or OpenAI manages the GPU, the model weights, the inference runtime, and the content filters. Your ops surface is prompts, evals, and cost control.

Self-hosted LLMOps is what remains when you take all of that back. You own the GPU lifecycle, the model serving process, the scaling logic, the guardrails, the observability pipeline, and the feedback loop that improves quality over time. The tradeoffs that make self-hosting worth it — data sovereignty, cost at volume, no vendor lock-in, full control over serving parameters — come with a proportional operational commitment.

LLMOps also borrows MLOps vocabulary but the failure modes are different. An ML model fails silently when its distribution drifts. An LLM fails loudly — with confident nonsense, injected instructions, hallucinated citations, or a 30-second response time that breaks your frontend timeout. The operational discipline has to match the failure mode.

This guide covers the full lifecycle: observability at each layer of the stack, evaluation design, prompt engineering, RAG, fine-tuning, cost optimization, CI/CD, and the feedback loops that close the improvement cycle.

The implementations use AKS — KAITO, NAP, KEDA, APIM, Workload Identity as discussed on https://rtrentinsworld.com/2026/03/27/running-llm-inference-on-aks/ — but the operational patterns apply to any self-hosted inference deployment.

Observability

Inference observability operates at two distinct layers that require different tools and answer different questions.

System layer — what is the GPU doing? Is the KV cache full? Are requests queueing? This is answered by vLLM’s Prometheus metrics, surfaced in the lab’s Azure Managed Grafana.

Application layer — which prompt produced a bad answer? Which user session had high latency? What was the token distribution across requests? This is answered by a tracing tool like Langfuse that captures the semantic content of each call.

You need both. System metrics tell you the machine is sick; application traces tell you which patient is suffering.

Latency has three components — measure all three

Most teams measure only end-to-end response time and miss two diagnostically distinct signals:

MetricDefinitionWhat causes itSLO target
TTFT — Time to First TokenWall clock from request send to first token receivedPrefill phase: processing the entire input prompt< 500ms for chat, < 3s for long-context RAG
TPOT — Time Per Output TokenAverage milliseconds per generated token after the firstDecode phase: GPU throughput< 30ms/token for real-time chat
E2E latencyTotal request timeTTFT + (completion_tokens × TPOT) + networkFunction of both above + payload size

Why this matters: TTFT and TPOT have different root causes and different fixes. High TTFT means your prefill is too long (large context, no prefix caching, or the scheduler is overwhelmed). High TPOT means your GPU is undersized or oversubscribed. Measuring only E2E hides which knob to turn.

vLLM metrics — the essential set

vLLM exposes a Prometheus endpoint at /metrics. With the lab’s Azure Managed Prometheus scraping vLLM pods, these queries go directly into Grafana.

Request queue health:

# Requests waiting for a GPU slot — the primary scaling signal
vllm:num_requests_waiting{namespace="inference"}
# Running sequences — are we at max-num-seqs capacity?
vllm:num_requests_running{namespace="inference"}

When num_requests_waiting is consistently > 0, you have more demand than GPU capacity. KEDA should be watching this.

KV cache utilization — the throughput governor:

# KV cache fill rate — aim for 70-85% at peak, not 95%+
vllm:gpu_cache_usage_perc{namespace="inference"}

At 95%+ KV cache utilization, vLLM starts evicting blocks from queued sequences. This causes TTFT spikes as prefills get re-processed. Size max-num-seqs so you hit 80-85% at expected peak, not at maximum concurrency.

Token throughput — your cost efficiency signal:

# Tokens generated per second (decode throughput)
rate(vllm:generation_tokens_total{namespace="inference"}[5m])
# Tokens processed in prefill per second
rate(vllm:prompt_tokens_total{namespace="inference"}[5m])

A healthy vLLM pod on a T4 with Phi-4 Mini should sustain 2,000–4,000 generation tokens/second. If you’re seeing 500 tokens/second at moderate load, the GPU is undersized or there’s a scheduling pathology.

Latency percentiles — for SLO compliance:

# p95 time to first token
histogram_quantile(0.95, rate(vllm:time_to_first_token_seconds_bucket{namespace="inference"}[10m]))
# p95 time per output token
histogram_quantile(0.95, rate(vllm:time_per_output_token_seconds_bucket{namespace="inference"}[10m]))
# p95 end-to-end latency
histogram_quantile(0.95, rate(vllm:e2e_request_latency_seconds_bucket{namespace="inference"}[10m]))

GPU hardware metrics — from the DCGM exporter (manifests/monitoring/dcgm-exporter.yaml):

# GPU compute utilization — should be 70-90% under load
DCGM_FI_DEV_GPU_UTIL{namespace="inference"}
# GPU memory used vs total
DCGM_FI_DEV_FB_USED{namespace="inference"} / DCGM_FI_DEV_FB_TOTAL{namespace="inference"}
# GPU temperature — alert above 85°C
DCGM_FI_DEV_GPU_TEMP{namespace="inference"}

Low GPU utilization (< 40%) at peak load means the GPU is waiting on something — likely the KV scheduler, CPU tokenization, or a max-num-seqs ceiling that’s too low. High utilization (> 95%) with growing request queues means you need more replicas.

APIM metrics — cost attribution at the consumer level

APIM’s Application Insights integration provides the token attribution data that vLLM doesn’t have: which consumer is sending how many tokens.

Configure token emission in the APIM policy outbound section:

<outbound>
<base />
<llm-emit-token-metric namespace="InferenceTokens">
<dimension name="consumer-id" value="@(context.Subscription.Id)" />
<dimension name="model" value="@(context.Request.Body.As<JObject>(preserveContent: true)["model"]?.ToString() ?? "unknown")" />
<dimension name="environment" value="@(context.Deployment.ServiceName.Contains("prod") ? "prod" : "dev")" />
</llm-emit-token-metric>
</outbound>

This feeds a Log Analytics query for monthly chargeback by team:

customMetrics
| where name == "InferenceTokens"
| summarize total_tokens = sum(value) by tostring(customDimensions["consumer-id"]), bin(timestamp, 1d)
| order by total_tokens desc

Application-level tracing with Langfuse

Langfuse captures the semantic content of each LLM call: which prompt, which response, latency, token counts, and any scores you attach. This is where debugging happens when a user reports a bad answer.

Deploy Langfuse on AKS:

helm repo add langfuse https://langfuse.github.io/langfuse-k8s
helm upgrade --install langfuse langfuse/langfuse \
--namespace langfuse --create-namespace \
--set langfuse.nextauth.secret="$(openssl rand -hex 32)" \
--set langfuse.salt="$(openssl rand -hex 16)" \
--set postgresql.enabled=true \
--set postgresql.auth.database=langfuse \
--set langfuse.resources.requests.memory="512Mi" \
--set langfuse.resources.requests.cpu="250m"

Route through the cluster’s Envoy Gateway for internal access. For external access, put Langfuse behind APIM with AAD auth — it contains prompt content and should not be publicly accessible.

SDK instrumentation (Python):

from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context
langfuse = Langfuse(
public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
secret_key=os.environ["LANGFUSE_SECRET_KEY"],
host=os.environ["LANGFUSE_HOST"], # internal cluster URL
)
@observe()
def generate_response(user_query: str, session_id: str) -> str:
langfuse_context.update_current_trace(
user_id=session_id,
tags=["production", "customer-support"],
)
# Retrieve context (if RAG)
chunks = retrieve(user_query)
langfuse_context.update_current_observation(
input={"query": user_query, "chunks_retrieved": len(chunks)},
)
# Call vLLM (OpenAI-compatible endpoint)
response = openai_client.chat.completions.create(
model="phi-4-mini-instruct",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Context:\n{chunks}\n\nQuestion: {user_query}"}
],
max_tokens=512,
)
output = response.choices[0].message.content
# Attach quality score if you have one
langfuse_context.score_current_trace(
name="groundedness",
value=check_groundedness(output, chunks),
)
return output

Trace correlation across APIM and vLLM

Propagate a trace ID from APIM through to the application so a single request is traceable across all layers:

<!-- APIM inbound: generate and forward trace ID -->
<set-header name="X-Trace-Id" exists-action="skip">
<value>@(Guid.NewGuid().ToString())</value>
</set-header>
<set-variable name="traceId" value="@(context.Request.Headers.GetValueOrDefault("X-Trace-Id"))" />

Your application reads X-Trace-Id from the incoming request and passes it to Langfuse as the trace_id. This lets you correlate a Langfuse trace with APIM logs, vLLM logs, and Kubernetes pod logs using a single ID.

Evaluation

Evals are the tests for your LLM system. Without them, you cannot safely change a prompt, upgrade a model, or modify a RAG pipeline — you’re deploying blind.

The hardest part is not the tooling. It’s defining what “correct” means for your task, assembling representative test cases, and deciding which failure modes matter most. The tooling is secondary.

What you’re actually testing

The eval surface for an LLM system has three layers, and they require different techniques:

LayerWhat changesEval technique
PromptWording, instructions, examplesGolden dataset comparison
ModelWeights, quantization, versionBenchmark regression
RAG pipelineChunking, retrieval, re-rankingRetrieval + faithfulness metrics

Each layer has different change frequency. Prompts change most often (weekly in active development). Models change occasionally (when a new version drops). RAG pipeline changes when document corpus or retrieval quality issues are found.

Building your first golden dataset

The bootstrap problem: you need test cases before you have production data, but the best test cases come from production failures. How to start:

Step 1 — Write 20 cases by hand. Cover the happy path (typical query, good answer), three known edge cases, and two adversarial inputs. Write the expected answer in detail — not “a correct answer” but the specific things it must contain.

Step 2 — Generate synthetic variants. Use GPT-4o or a strong model to paraphrase your 20 cases into 60–80 variants. Prompt: “Generate 4 rephrased versions of this user question that ask the same thing differently.” This gives you coverage without manual effort.

Step 3 — Collect production failures once deployed. Every time a user flags a bad answer (thumbs down, escalation, correction), add it to the dataset. Production failures are worth 10× synthetic cases because they represent real failure modes you didn’t anticipate.

Step 4 — Balance the dataset. Check that your cases cover the full distribution of your real traffic — length, topic, complexity. A dataset of 100 short simple questions will pass with flying colors while the 20% of long complex queries fail in production.

Minimum viable dataset size:

  • 50–100 cases: can detect changes ≥ 10% in quality
  • 200–500 cases: can detect changes ≥ 5%, meaningful regression testing
  • 1,000+ cases: statistical confidence for fine-grained comparison

For a 95% confidence interval with 5% margin of error, you need ~385 test cases. For 2% margin of error, ~2,400. Budget accordingly.

Assertion types

Deterministic assertions — for outputs with a known right answer:

# promptfooconfig.yaml
tests:
- vars:
query: "What port does vLLM listen on by default?"
assert:
- type: contains
value: "8000"
- type: not-contains
value: "8080" # common wrong answer
- type: javascript
value: |
output.length < 200 # reject verbose responses

Use deterministic assertions for: factual questions, structured output format, required keywords, output length constraints.

LLM-as-judge — for quality dimensions that can’t be checked with string matching:

tests:
- vars:
context: "The document says X, Y, and Z."
query: "Summarize the key points."
assert:
- type: llm-rubric
value: |
The response should:
1. Mention X, Y, and Z from the provided context
2. Not introduce information not present in the context
3. Be written in 2-4 sentences
4. Not start with "Certainly!" or similar filler

Custom validator (Python) for task-specific checks:

# In your test suite
def check_json_output(output: str, context: dict) -> dict:
"""Validate structured output is valid JSON matching expected schema."""
import json
from jsonschema import validate
expected_schema = {
"type": "object",
"required": ["category", "confidence", "reason"],
"properties": {
"category": {"type": "string", "enum": ["billing", "technical", "account"]},
"confidence": {"type": "number", "minimum": 0, "maximum": 1},
"reason": {"type": "string"},
}
}
try:
parsed = json.loads(output)
validate(parsed, expected_schema)
return {"pass": True, "score": parsed["confidence"]}
except Exception as e:
return {"pass": False, "reason": str(e)}

LLM-as-judge — implementation details

Using an LLM to evaluate another LLM’s output is powerful but has documented biases:

  • Position bias: the judge prefers the first answer when comparing two
  • Verbosity bias: the judge prefers longer responses even when they’re less accurate
  • Self-enhancement bias: GPT-4o ranks GPT-4o outputs higher; use a different family as judge

Calibrated judge prompt pattern:

JUDGE_PROMPT = """You are an expert evaluator for a {task_type} system.
Evaluate the following response on a scale of 1-5 for {dimension}:
1 = Completely wrong or harmful
2 = Mostly wrong with minor correct elements
3 = Partially correct but missing key information
4 = Mostly correct with minor issues
5 = Completely correct and well-formed
Task: {task_description}
User question: {user_query}
{context_block}
Response to evaluate:
{response}
Provide your evaluation in this exact JSON format:
{{
"score": <1-5>,
"reasoning": "<one sentence explaining the score>",
"key_issues": ["<issue 1>", "<issue 2>"]
}}
Do not consider response length in your score. Evaluate only accuracy and completeness."""

Calibration: Before using LLM-as-judge at scale, label 50–100 examples yourself and measure judge-human agreement (Cohen’s kappa). Target kappa > 0.6 (substantial agreement). If the judge disagrees with your labels on > 30% of cases, revise the prompt.

RAG-specific evaluation with RAGAS

RAGAS evaluates the full RAG pipeline — not just the final answer but the retrieval quality.

from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
# Build evaluation dataset
data = {
"question": ["What is the maximum context length for Phi-4 Mini?"],
"answer": ["Phi-4 Mini supports up to 128K context tokens."], # model output
"contexts": [["Phi-4 Mini has a 128K context window and 3.8B params"]], # retrieved chunks
"ground_truth": ["Phi-4 Mini has a 128K token context window."], # expected answer
}
dataset = Dataset.from_dict(data)
results = evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
llm=your_azure_openai_client, # judge model — use a stronger model than the one you're evaluating
)
MetricWhat a low score meansFix
FaithfulnessAnswer contains information not in retrieved chunksReduce hallucination: lower temperature, add “only use the provided context” instruction
Answer relevancyAnswer doesn’t address the questionImprove generation prompt instructions
Context precisionRetrieved chunks contain lots of irrelevant contentImprove retrieval: better embedding model, tighter query, stricter similarity threshold
Context recallRetrieval missed chunks needed to answerImprove retrieval: more chunks per query, smaller chunk size, re-ranking

Run RAGAS in CI on every change to your chunking strategy, embedding model, or retrieval parameters. A 5% drop in context recall on your golden dataset is a merge blocker.

Tracking regressions over time

Store eval results with timestamps and compare model-by-model in Langfuse or a simple PostgreSQL table:

CREATE TABLE eval_runs (
run_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
run_at TIMESTAMPTZ NOT NULL DEFAULT now(),
model VARCHAR(100),
prompt_version VARCHAR(20),
dataset_name VARCHAR(100),
pass_rate FLOAT,
avg_faithfulness FLOAT,
avg_relevancy FLOAT,
p95_latency_ms INT
);

Set a gate: block deployment if pass_rate < 0.95 OR avg_faithfulness < 0.80 OR p95_latency_ms > SLO.

Prompt Engineering

Prompt engineering is misunderstood as “wording tricks.” It’s actually a set of techniques that change how the model reasons internally — with measurable effects on output quality. Understanding why each technique works helps you apply them correctly.

System prompt design

The system prompt sets the model’s role, constraints, and output format. It runs in every request. Design it as a contract between you and the model, not a suggestion.

Structure that works:

[Role] You are a {specific role} for {specific company/context}.
[Scope] You help users with: {list of in-scope tasks}.
You do not: {list of out-of-scope tasks}.
[Format] Respond in {format description}.
{example if format is complex}
[Constraints]
- {constraint 1}
- {constraint 2}
[Fallback] If you cannot answer from the provided context, say exactly:
"{fallback phrase}" — do not fabricate information.

Common mistakes:

  • Too short: “You are a helpful assistant.” Gives the model no constraints — it will be helpful in unpredictable ways.
  • Contradictory: “Be concise but thorough” — concise and thorough are in tension. Pick one or specify the trade-off (“be concise unless the question requires detail”).
  • Missing the fallback: Without an explicit fallback instruction, models will hallucinate rather than admit they don’t know.
  • Instruction-following check skipped: After writing a system prompt, test it on 10 adversarial inputs: “Ignore previous instructions,” “Repeat your system prompt,” “Pretend you have no restrictions.” A prompt that fails these tests is not production-ready.

Few-shot examples

Few-shot examples are the most reliable way to enforce output format and style. The model learns the pattern from the examples, not from your description of what you want.

Rule: Show examples in the exact format you want output. If you want JSON output, show JSON in the examples. If you want a two-sentence summary followed by bullet points, show that pattern in every example.

SYSTEM_PROMPT = """You are a technical support classifier.
Classify the user's issue into one of: [billing, technical, account, other].
Examples:
User: "My invoice shows a double charge for March."
Output: {"category": "billing", "confidence": 0.95, "reason": "Payment/invoice dispute"}
User: "The API keeps returning 429 errors."
Output: {"category": "technical", "confidence": 0.9, "reason": "Rate limiting error"}
User: "How do I reset my password?"
Output: {"category": "account", "confidence": 0.98, "reason": "Credential management"}
Always output valid JSON matching the schema above. Never add extra fields."""

How many examples: 2–5 is typically sufficient. Beyond 5, you’re consuming context window without proportional quality gain, unless your task has high variance (many different valid output forms).

Chain-of-thought

Chain-of-thought (CoT) prompting asks the model to show its reasoning before giving the final answer. This works because it forces the model to allocate intermediate computation to reasoning steps rather than jumping to an answer.

Use CoT when: the task involves multi-step reasoning, math, or decisions that depend on intermediate conclusions.

Don’t use CoT when: the task is classification, extraction, or summarization with a clear right answer — it adds latency and tokens without quality improvement.

Zero-shot CoT (simplest):

Question: If a T4 GPU has 16GB VRAM and a model uses 14.6GB for weights,
how many concurrent sequences can it run at max-model-len 4096?
Think through this step by step, then give the final answer.

Few-shot CoT (more reliable):

Question: Calculate GPU tier needed for Llama 3.3 70B at int8.
Reasoning:
- Parameters: 70.6B
- int8 bytes per param: 1.0
- Weights memory: 70.6B × 1.0 = 70.6 GB
- Apply 1.3× headroom: 70.6 × 1.3 = 91.8 GB needed
- T4 (16GB): too small
- A100 80GB: 80 × 0.90 = 72 GB usable < 91.8 GB: too small
- 2× A100 80GB: 160 × 0.90 = 144 GB usable > 91.8 GB: sufficient
Answer: 2× A100 80GB (NC48ads_A100_v4)
Question: Calculate GPU tier needed for Phi-4 Mini at fp16.
Reasoning:
<model completes the pattern>

Structured output

For tasks that produce JSON, Markdown tables, or other structured formats, reliability matters. Three techniques in order of reliability:

Option 1 — JSON mode (vLLM / OpenAI API):

response = client.chat.completions.create(
model="phi-4-mini-instruct",
messages=[...],
response_format={"type": "json_object"}, # forces JSON output
)

JSON mode guarantees syntactically valid JSON but not schema compliance.

Option 2 — Grammar-constrained decoding (vLLM, most reliable):

from vllm import LLM, SamplingParams
schema = {
"type": "object",
"properties": {
"category": {"type": "string", "enum": ["billing", "technical", "account"]},
"confidence": {"type": "number"},
},
"required": ["category", "confidence"]
}
sampling_params = SamplingParams(
guided_decoding={"json": json.dumps(schema)} # tokens that violate schema are masked
)

Grammar-constrained decoding modifies the token probability distribution at each step so only tokens that keep the output valid are sampled. 100% schema compliance, no retry logic needed.

Option 3 — Guardrails AI (post-processing validation):

from guardrails import Guard
from guardrails.hub import ValidJSON, ValidChoices
guard = Guard().use_many(
ValidJSON(),
ValidChoices(choices=["billing", "technical", "account"], on_fail="reask"),
)
response = guard(
openai_client.chat.completions.create,
prompt="Classify this support ticket: ...",
model="phi-4-mini-instruct",
max_tokens=200,
)

Guardrails AI retries up to N times with an error message injected into the context, asking the model to fix its output.

Context window management

Long conversations degrade quality. As the context grows, models give less attention to the system prompt and early instructions. At 60–80% of the context window, instruction following typically degrades.

Three mitigation strategies:

Progressive summarization:

def manage_context(messages: list, model_max_tokens: int, reserve_tokens: int = 1000) -> list:
"""Summarize old messages when context approaches limit."""
current_tokens = count_tokens(messages)
limit = model_max_tokens - reserve_tokens # reserve for completion
if current_tokens < limit * 0.7:
return messages # no action needed
# Keep system prompt + last 4 turns + summarize the rest
system = [m for m in messages if m["role"] == "system"]
recent = messages[-4:]
to_summarize = [m for m in messages if m not in system and m not in recent]
if not to_summarize:
return messages
summary = summarize_conversation(to_summarize) # call LLM to summarize
summary_msg = {"role": "assistant", "content": f"[Previous conversation summary: {summary}]"}
return system + [summary_msg] + recent

Selective context injection (RAG conversations): Instead of accumulating the full conversation, re-retrieve context on each turn. The user’s latest message contains most of the retrieval signal needed — prior turns add diminishing value.

Fixed-size sliding window: For multi-turn chat, keep only the last N turns. Simple and effective for most chatbot use cases. N=10 turns covers 95%+ of real conversations while keeping context manageable.

RAG Patterns

RAG adds a retrieval step that makes the model’s answer dependent on your documents, not its training data. This is correct for domain-specific, frequently-changing, or private information. The tradeoff: quality is now bounded by both retrieval quality and generation quality.

Chunking — the upstream bottleneck

Bad chunking propagates through the entire pipeline. A missed fact at the chunking step cannot be recovered by better retrieval or a better model.

Fixed-size chunking with overlap:

from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # tokens, not characters
chunk_overlap=50, # ~10% overlap to avoid boundary splits
length_function=lambda text: len(tokenizer.encode(text)), # use target model's tokenizer
)

Why 512 tokens: at this size, each chunk contains roughly one coherent topic. Larger chunks increase recall but decrease precision (more noise per retrieved chunk). Smaller chunks increase precision but miss context that spans multiple sentences.

Sentence-aware chunking (better for prose):

from langchain.text_splitter import SpacyTextSplitter
# Respects sentence boundaries — never splits mid-sentence
splitter = SpacyTextSplitter(chunk_size=512)

Code-aware chunking (critical for technical docs):

from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
# Splits at function/class boundaries, not arbitrary character positions
splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size=1000,
chunk_overlap=100,
)

For codebases, splitting at the function level (using the AST) outperforms fixed-size splitting by 15–25% on code retrieval tasks. Each function is a semantic unit — a fixed-size splitter cuts functions in half.

Metadata enrichment at index time: Attach metadata to every chunk before storing it. This enables filtered retrieval later:

chunks_with_metadata = [
{
"content": chunk.page_content,
"metadata": {
"source": document_path,
"section": extract_section_heading(chunk),
"doc_type": "technical_guide",
"last_updated": document_date,
"language": "en",
}
}
for chunk in chunks
]

Retrieval strategies

Sparse + dense hybrid retrieval: Neither BM25 (keyword) nor vector (semantic) retrieval dominates across all query types. Sparse retrieval is better for exact term matching (product codes, error messages, proper nouns). Dense retrieval is better for semantic similarity (“how do I fix latency” ↔ “TTFT optimization”).

Combining them consistently outperforms either alone.

from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery
def hybrid_retrieve(query: str, top_k: int = 20) -> list:
"""Combine BM25 and vector search, return top-k by reciprocal rank fusion."""
query_embedding = embed(query)
results = search_client.search(
search_text=query, # BM25 path
vector_queries=[
VectorizedQuery(
vector=query_embedding,
k_nearest_neighbors=top_k,
fields="content_vector" # dense path
)
],
query_type="semantic", # rerank with semantic model
semantic_configuration_name="inference-config",
top=top_k,
select=["content", "source", "section", "metadata"],
)
return list(results)

Azure AI Search handles the fusion and semantic re-ranking natively when query_type="semantic".

Filtered retrieval — scoped to relevant documents:

results = search_client.search(
search_text=query,
filter="doc_type eq 'technical_guide' and last_updated ge 2026-01-01",
top=10,
)

Filtering before retrieval is more efficient than filtering top-N results after retrieval. Set filters based on available metadata — document type, recency, access level, user context.

HyDE (Hypothetical Document Embedding): For queries that are short and abstract (“how does KEDA scaling work?”), the query embedding is often too sparse to retrieve the right chunks. HyDE generates a hypothetical answer first, embeds the answer rather than the query, and retrieves documents similar to the hypothetical answer.

def hyde_retrieve(query: str, llm_client, top_k: int = 5) -> list:
"""Retrieve using a hypothetical answer embedding instead of the raw query."""
# Generate a hypothetical ideal answer (doesn't need to be accurate)
hyde_response = llm_client.chat.completions.create(
model="phi-4-mini-instruct",
messages=[{
"role": "user",
"content": f"Write a 3-sentence technical explanation that would answer: {query}"
}],
max_tokens=150,
)
hypothetical_answer = hyde_response.choices[0].message.content
# Embed the hypothetical answer and retrieve
embedding = embed(hypothetical_answer)
return vector_search(embedding, top_k=top_k)

HyDE improves recall on abstract or paraphrased queries by 10–20% at the cost of one additional LLM call.

Re-ranking

Retrieval returns candidates. Re-ranking selects the best ones. A cross-encoder re-ranker reads the query and each document together and produces a relevance score — it’s slower than embedding similarity but significantly more accurate.

from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query: str, candidates: list[str], top_n: int = 5) -> list[str]:
"""Score query-document pairs and return top_n."""
pairs = [(query, doc) for doc in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(scores, candidates), reverse=True)
return [doc for _, doc in ranked[:top_n]]

Retrieval strategy by use case:

Use caseStrategy
Simple Q&A over structured docsDense-only, top-5, no re-rank (fast)
Technical support over mixed contentHybrid (BM25 + dense), re-rank top-20 → top-5
Legal/compliance document searchHybrid + metadata filter + re-rank + citation
Multi-hop questions (answer requires >1 doc)Iterative retrieval or graph-based RAG

Multi-hop retrieval for complex questions

Some questions cannot be answered from a single chunk — the answer requires combining information across multiple documents. Standard single-shot retrieval fails here.

Iterative retrieval:

def multi_hop_retrieve(question: str, max_hops: int = 3) -> list:
all_contexts = []
current_query = question
for hop in range(max_hops):
chunks = retrieve(current_query, top_k=3)
all_contexts.extend(chunks)
# Ask the model: do we have enough information? If not, what do we still need?
reflection = llm_client.chat.completions.create(
model="phi-4-mini-instruct",
messages=[{
"role": "user",
"content": f"""Question: {question}
Retrieved so far:
{format_chunks(all_contexts)}
Can you fully answer the question with the above context?
If yes, respond: "COMPLETE"
If no, respond with the specific follow-up question needed to find the missing information."""
}],
max_tokens=100,
).choices[0].message.content
if reflection.strip() == "COMPLETE" or hop == max_hops - 1:
break
current_query = reflection # next hop uses the model's follow-up question
return all_contexts

Fine-Tuning on AKS

Fine-tuning is often reached for too early. Before investing in it, try prompt engineering and RAG — they’re faster to iterate. Fine-tune when:

  • Latency/cost reduction: you need GPT-4-level task quality from a T4-tier model. A 7B model fine-tuned on your specific task often outperforms a 70B general model on that task.
  • Consistent structured output: the model needs to reliably produce a specific JSON schema or output format that prompt engineering can’t reliably enforce.
  • Style and voice: the model needs to write in a specific brand voice or follow house style that’s difficult to describe in a prompt.
  • Knowledge consolidation: you have proprietary data that changes infrequently and can be baked into the weights. Note: for frequently-changing data, RAG is almost always better.

Don’t fine-tune when:

  • Your task success rate is below 70% on your eval set — the model doesn’t understand the task at all. More data won’t fix a fundamentally wrong model; fix your prompt first.
  • You have fewer than 500 high-quality labeled examples. Fine-tuning on low-quality or insufficient data produces a model that confidently does the wrong thing.
  • Your use case is adding new knowledge (facts, documents, product catalog). Models don’t reliably memorize facts through fine-tuning; they learn behavioral patterns. Use RAG.

LoRA and QLoRA — what you’re actually training

Full fine-tuning updates all weights — computationally prohibitive for 7B+ parameter models on single GPUs. LoRA (Low-Rank Adaptation) is a parameter-efficient technique that freezes the original weights and adds small trainable adapter matrices.

The math, briefly: instead of updating a weight matrix W (size d×d), LoRA adds two matrices A (d×r) and B (r×d) where r is the “rank” — typically 8–64. Total trainable parameters: 2 × d × r instead of d². At rank 16 for a 7B model, you train ~0.1% of the parameters while preserving most quality.

QLoRA runs LoRA on a 4-bit quantized base model, halving the VRAM required for training. A Mistral 7B fine-tune that requires 30GB on a standard A100 runs in ~12GB with QLoRA — fits on a single A100 80GB node.

Dataset preparation

The quality of your training data determines the ceiling of your fine-tuned model.

Format for instruction fine-tuning:

{"messages": [
{"role": "system", "content": "You are a customer support agent for AKS AI Lab."},
{"role": "user", "content": "My GPU node didn't provision. What do I check?"},
{"role": "assistant", "content": "Check these in order:\n1. Run `kubectl get nodeclaim -n karpenter` — look for a Failed NodeClaim\n2. Check AKS quota: `az vm list-usage -l eastus | grep NC`\n3. Check the Karpenter controller logs: `kubectl logs -n karpenter deployment/karpenter | grep ERROR`"}
]}

Minimum viable dataset sizes:

GoalMin examplesNotes
Format/style adaptation500Model already knows the domain; you’re shaping output style
Domain-specific knowledge2,000–5,000Model needs to learn new facts + format
Task specialization1,000–3,000High-quality examples matter more than quantity
Safety/refusal training500+ (+ neg. examples)Include both positive and “this should be refused” pairs

Quality checklist before training:

  •  Every example is correct — wrong examples actively degrade the model
  •  No duplicate or near-duplicate examples (deduplicate by semantic similarity)
  •  Coverage is balanced — check topic/length/complexity distribution
  •  No PII in training data
  •  Adversarial inputs have appropriate refusal responses
  •  Output format is consistent across all examples

Training with KAITO on AKS

KAITO supports QLoRA fine-tuning jobs via a Workspace CRD with inference: false and a tuning spec:

apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
name: finetune-mistral-7b
namespace: inference
spec:
resource:
instanceType: "Standard_NC24ads_A100_v4" # A100 80GB for training
labelSelector:
matchLabels:
apps: mistral-7b-finetune
tuning:
preset:
name: mistral-7b-v0.3
method: qlora
input:
urls:
- "https://<storage>.blob.core.windows.net/training/dataset.jsonl" # Workload Identity auth
output:
image: "<your-acr>.azurecr.io/mistral-7b-support:v1"
imagePushSecret: acr-push-secret
config:
# LoRA hyperparameters
lora_rank: 16
lora_alpha: 32
lora_dropout: 0.05
target_modules: ["q_proj", "v_proj", "k_proj", "o_proj"]
# Training config
num_epochs: 3
per_device_train_batch_size: 4
gradient_accumulation_steps: 4 # effective batch = 16
learning_rate: 2.0e-4
warmup_ratio: 0.03
lr_scheduler_type: "cosine"
# Memory optimization
gradient_checkpointing: true
bf16: true # A100 supports bfloat16
max_seq_length: 2048

LoRA hyperparameter guidance:

  • lora_rank: Start at 16. Increase to 32–64 if quality is poor; higher rank = more expressiveness but more parameters.
  • lora_alpha: Set to 2× lora_rank as a starting point. Controls the magnitude of the LoRA update.
  • target_modules: For most transformer models, ["q_proj", "v_proj"] is the minimum. Adding k_projo_proj, and MLP layers (gate_projup_projdown_proj) increases quality at the cost of more parameters.
  • learning_rate: 1e-4 to 3e-4 for QLoRA. Higher than standard fine-tuning because you’re training fewer parameters.
  • num_epochs: 2–5. Monitor validation loss — if it starts rising, stop early.

Evaluating the fine-tuned model

Never deploy a fine-tuned model based on training loss alone. Training loss measures fit to the training set, not generalization or task quality.

Evaluation pipeline:

def evaluate_fine_tuned_model(
base_model_client,
finetuned_model_client,
eval_dataset: list[dict],
) -> dict:
"""Run eval on both models, compare on quality and format compliance."""
results = {"base": [], "finetuned": []}
for example in eval_dataset:
for name, client in [("base", base_model_client), ("finetuned", finetuned_model_client)]:
response = client.chat.completions.create(
messages=example["messages"][:-1], # exclude gold response
max_tokens=512,
temperature=0,
)
output = response.choices[0].message.content
results[name].append({
"output": output,
"latency_ms": response.usage.completion_tokens * 30, # rough estimate
"format_valid": check_format(output, example["expected_format"]),
"judge_score": llm_judge(example["messages"][-2]["content"], output),
})
return {
"base_pass_rate": mean(r["format_valid"] for r in results["base"]),
"finetuned_pass_rate": mean(r["format_valid"] for r in results["finetuned"]),
"base_quality": mean(r["judge_score"] for r in results["base"]),
"finetuned_quality": mean(r["judge_score"] for r in results["finetuned"]),
"quality_delta": mean(r["judge_score"] for r in results["finetuned"]) - mean(r["judge_score"] for r in results["base"]),
}

Promotion gate: deploy the fine-tuned model only if:

  • quality_delta > 0.10 (≥ 10% quality improvement)
  • finetuned_pass_rate > 0.95 (95% format compliance)
  • p95 latency ≤ SLO (fine-tuning doesn’t change model size, but verify)
  • No regression on held-out adversarial/safety examples

Cost Optimization

GPU inference is expensive. The three levers are: run the smallest adequate model, reduce token count, and avoid redundant computation.

The actual cost model

Cost per request = (prompt_tokens + completion_tokens) × $/token
= prompt_tokens × (GPU $/hr) / (prompt_throughput tok/s × 3600)
+ completion_tokens × (GPU $/hr) / (generation_throughput tok/s × 3600)

Completion tokens cost significantly more than prompt tokens because generation is sequential (one token per forward pass), while prompts can be processed in parallel. On a T4 with Phi-4 Mini:

  • Prompt processing: ~15,000 tokens/second (parallel)
  • Generation: ~3,000 tokens/second (sequential)

A request with 500 prompt tokens + 300 completion tokens:

Prompt cost: 500 / 15,000 × $0.53/hr / 3,600 = $0.0000049
Generation: 300 / 3,000 × $0.53/hr / 3,600 = $0.0000147
Total: ~$0.000020 per request

At 50,000 requests/day: $1.00/day in GPU time. The system node pool is $8.88/day. At this volume, infrastructure cost dominates.

Prefix caching — the highest-impact optimization

If multiple requests share the same system prompt or conversation prefix, vLLM can reuse the KV cache for those tokens instead of recomputing them. This is called automatic prefix caching (APC).

Enable it for free:

# In manifests/vllm/vllm-standalone.yaml
args:
- --enable-prefix-caching

Impact: for a chatbot with a 500-token system prompt, every second-and-beyond turn in the conversation reuses those 500 tokens from cache. At 10 turns per session and 10,000 sessions/day, this eliminates 45M token computations per day — roughly 4× the GPU throughput for the same hardware.

Measuring cache effectiveness:

# Cache hit rate — should be > 50% for chatbot use cases
vllm:cache_config_info{namespace="inference"}
rate(vllm:request_success_total{finished_reason="length", namespace="inference"}[5m])

Monitor the hit rate. If it’s below 20% for a chatbot use case, check that your system prompt is truly identical across requests (whitespace differences break cache matches).

Exact and semantic caching

Exact caching (Redis) — for repeated identical queries:

import hashlib
import redis
cache = redis.Redis(host="redis.inference.svc.cluster.local", port=6379)
CACHE_TTL = 3600 # 1 hour
def cached_inference(messages: list, model: str, **kwargs) -> str:
cache_key = hashlib.sha256(
json.dumps({"messages": messages, "model": model}).encode()
).hexdigest()
if cached := cache.get(cache_key):
return json.loads(cached)
response = llm_client.chat.completions.create(
messages=messages, model=model, **kwargs
)
result = response.choices[0].message.content
cache.setex(cache_key, CACHE_TTL, json.dumps(result))
return result

Best for: FAQ bots, documentation queries, classification tasks where users ask the same things repeatedly.

Semantic caching — for near-duplicate queries:

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
class SemanticCache:
def __init__(self, similarity_threshold: float = 0.95):
self.threshold = similarity_threshold
self.cache: list[tuple[np.ndarray, str, str]] = [] # (embedding, query, response)
def get(self, query: str) -> str | None:
query_emb = np.array(embed(query))
for stored_emb, stored_query, stored_response in self.cache:
sim = cosine_similarity([query_emb], [stored_emb])[0][0]
if sim >= self.threshold:
return stored_response
return None
def set(self, query: str, response: str):
self.cache.append((np.array(embed(query)), query, response))

Important caveat: semantic caching introduces latency for the embedding call (10–50ms). Only worthwhile if your inference latency is high (> 500ms) and your query repetition rate is high (> 30%). Measure before deploying.

Model cascade — route by task complexity

Not every request needs your most capable model. A model cascade routes simple requests to a cheap fast model and complex requests to a powerful model.

ROUTING_PROMPT = """Classify this request's complexity:
- "simple": factual lookup, yes/no, short answer, format conversion
- "complex": multi-step reasoning, code generation, analysis, comparison
Request: {query}
Respond with only "simple" or "complex"."""
def cascade_route(query: str, context: str) -> str:
# Use a tiny fast model to classify request complexity
complexity = phi4_mini_client.chat.completions.create(
messages=[{"role": "user", "content": ROUTING_PROMPT.format(query=query)}],
max_tokens=5,
temperature=0,
).choices[0].message.content.strip().lower()
if complexity == "simple":
client = phi4_mini_client # T4, ~$0.53/hr
else:
client = llama70b_client # 2× A100, ~$7.34/hr
return client.chat.completions.create(
messages=[{"role": "system", "content": context},
{"role": "user", "content": query}],
max_tokens=512,
).choices[0].message.content

The classifier call (Phi-4 Mini) costs ~$0.000002. If 70% of queries are “simple” and the complex model costs 30× more, the cascade saves ~50% on inference cost with negligible quality degradation on the simple tier.

Prompt compression

Long prompts are expensive to process and take VRAM for KV cache. For RAG use cases where you’re injecting large retrieval contexts, consider compressing the context before sending it to the model.

LLMLingua strips tokens from the prompt that don’t contribute to the answer while preserving the information needed to answer the query:

from llmlingua import PromptCompressor
compressor = PromptCompressor(
model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
use_llmlingua2=True,
)
def compress_context(question: str, retrieved_chunks: list[str]) -> str:
context = "\n\n".join(retrieved_chunks)
result = compressor.compress_prompt(
context,
question=question,
target_token=512, # compress to 512 tokens regardless of input size
condition_in_question="after_condition",
rank_method="llmlingua",
)
return result["compressed_prompt"]

LLMLingua achieves 3–5× compression with < 5% quality degradation for most RAG tasks. At 2,000 tokens of retrieved context compressed to 512 tokens, you’ve reduced KV cache and TTFT by 75%.

CI/CD for LLM Changes

Three things change in an LLM system, and each requires a different pipeline:

Change typeRiskPipeline
Prompt updateMedium — subtle quality regressions, behavior driftEval → review → canary
Model version upgradeHigh — full behavior change, capability regression possibleFull benchmark → blue-green
RAG pipeline changeMedium-high — retrieval quality change silently degrades answersRAGAS eval → traffic sample comparison

Prompt CI/CD pipeline

# .github/workflows/prompt-eval.yml
name: Prompt Eval
on:
pull_request:
paths:
- "prompts/**"
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Promptfoo evals
run: npx promptfoo eval --config promptfooconfig.yaml --output results.json
env:
AZURE_OPENAI_API_KEY: ${{ secrets.AZURE_OPENAI_API_KEY }}
VLLM_ENDPOINT: ${{ secrets.VLLM_ENDPOINT }}
- name: Parse results and gate
run: |
PASS_RATE=$(jq '.results.stats.successes / .results.stats.total' results.json)
AVG_SCORE=$(jq '.results.stats.assertPassCount / .results.stats.assertCount' results.json)
echo "Pass rate: $PASS_RATE"
echo "Avg score: $AVG_SCORE"
# Gate: 95% pass rate, average score > 4.0 on 5-point scale
if (( $(echo "$PASS_RATE < 0.95" | bc -l) )); then
echo "::error::Pass rate $PASS_RATE below 0.95 threshold"
exit 1
fi
- name: Comment results on PR
uses: actions/github-script@v7
with:
script: |
const results = require('./results.json');
const body = `## Eval Results\n\n` +
`Pass rate: ${(results.results.stats.successes / results.results.stats.total * 100).toFixed(1)}%\n` +
`Failed cases: ${results.results.stats.failures}\n\n` +
`[Full results artifact](${process.env.GITHUB_SERVER_URL}/${process.env.GITHUB_REPOSITORY}/actions/runs/${process.env.GITHUB_RUN_ID})`;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body,
});

Canary rollout with per-version metrics:

<!-- APIM inbound: A/B split with version tracking -->
<set-variable name="promptVersion"
value="@(new Random().Next(100) < 10 ? "v2" : "v1")" />
<choose>
<when condition="@(context.Variables.GetValueOrDefault<string>("promptVersion") == "v2")">
<!-- Route to backend that loads v2 prompt -->
<set-header name="X-Prompt-Version" exists-action="override">
<value>v2</value>
</set-header>
</when>
</choose>
<!-- Always emit version dimension for comparison in Langfuse / App Insights -->
<set-header name="X-Prompt-Version-Actual" exists-action="override">
<value>@(context.Variables.GetValueOrDefault<string>("promptVersion"))</value>
</set-header>

Track quality_score by prompt_version in Langfuse for at least 200 samples before declaring v2 the winner and rolling to 100%.

Model upgrade pipeline

Model upgrades carry more risk than prompt changes — every behavior can shift.

1. Update model reference in staging deployment
2. Run full eval suite against staging (all golden datasets)
3. Run adversarial test suite (jailbreaks, injection attempts, refusal cases)
4. Run latency benchmark (TTFT, TPOT, throughput at target concurrency)
5. Human review of 20 randomly sampled outputs on complex cases
6. If all gates pass → blue-green deploy to production
7. Monitor for 48 hours at 5% traffic before full cutover
8. Rollback trigger: any SLO breach or quality drop > 5% in online eval

Blue-green for model upgrades — keep the old model running during transition:

# Two deployments, one service
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-phi4-v1
namespace: inference
spec:
selector:
matchLabels:
app: vllm
version: phi4-v1
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-phi4-v2
namespace: inference
spec:
selector:
matchLabels:
app: vllm
version: phi4-v2
---
# HTTPRoute: route 5% to v2
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: inference-route
spec:
rules:
- backendRefs:
- name: vllm-phi4-v1-svc
weight: 95
- name: vllm-phi4-v2-svc
weight: 5

RAG pipeline changes

Changing chunking strategy or embedding model requires re-indexing the entire corpus — a batch job, not a rolling deployment. Track which index version is active:

INDEX_VERSION = "v3-chunk512-bge-m3" # bump this on any pipeline change
def index_document(doc: str, source: str):
chunks = splitter.split_text(doc)
embeddings = embed_batch(chunks)
search_client.upload_documents([
{
"id": f"{source}-{i}-{INDEX_VERSION}",
"content": chunk,
"content_vector": emb,
"index_version": INDEX_VERSION,
"source": source,
}
for i, (chunk, emb) in enumerate(zip(chunks, embeddings))
])

Run RAGAS against both the old and new index before switching the production pointer. A 5% drop in context recall is a rollback signal.

Production Failure Modes

The cold-start problem

When KEDA scales from 0 replicas to 1 and NAP provisions a new GPU node, there is a 3–8 minute gap before the first request can be served:

KEDA scale trigger fires
↓ ~10s
New pod scheduled, NAP provisions node
↓ ~2-4 min
Node joins cluster, GPU driver initializes
↓ ~1-2 min
Pod starts, model weights loaded into VRAM
↓ ~1-2 min (Phi-4 Mini) to ~8 min (Llama 70B)
First request served

Mitigation strategies:

  • Keep minReplicaCount: 1 — one warm pod avoids the full cold start. The pod costs GPU time even at idle, but eliminates the provision delay.
  • Use predictive scaling — if your traffic has daily patterns (business hours peak), pre-scale 15 minutes before expected demand.
  • Implement a queue buffer — when all pods are busy and no warm pod exists, return HTTP 202 with a queue position rather than timing out. The client polls for completion.
# KEDA ScaledObject: keep 1 warm, allow scale-to-zero for large cost savings
# Use only for workloads where the cold-start is tolerable (batch jobs, async APIs)
minReplicaCount: 0
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-standalone
triggers:
- type: prometheus
metadata:
query: avg(vllm:num_requests_waiting{namespace="inference"})
threshold: "1"
activationThreshold: "1"

KV cache exhaustion under load

When vllm:gpu_cache_usage_perc approaches 100%, the scheduler starts preempting (evicting) in-progress sequences to make room for new ones. Preempted sequences must restart their prefill from scratch — this causes sudden TTFT spikes under high load.

Symptoms: TTFT spikes 3–10× at load that’s well within your max-num-seqs limit, with gpu_cache_usage_perc at 95%+.

Diagnosis:

# Watch KV cache in real time
kubectl exec -n inference <vllm-pod> -- curl -s http://localhost:8000/metrics \
| grep "gpu_cache_usage"

Fix (in order):

  1. Reduce max-num-seqs — you’re scheduling more concurrent sequences than the KV cache can hold
  2. Enable --kv-cache-dtype fp8 — halves KV cache memory on A100/H100
  3. Reduce max-model-len — each sequence reserves less KV space
  4. Add replicas and reduce max-num-seqs per pod proportionally

Conversation quality degradation at depth

Multi-turn conversations degrade because the model gives less attention weight to the system prompt as the context fills up. This is a known limitation of the attention mechanism, not a bug.

Signals:

  • User satisfaction scores drop after conversation turn 5–7
  • The model starts ignoring format instructions it followed in early turns
  • Langfuse traces show identical queries producing different quality scores based on conversation depth

Monitor it:

# In Langfuse, tag traces with turn number
langfuse_context.update_current_trace(
metadata={"conversation_turn": turn_number, "context_tokens": current_context_tokens}
)

Query: avg(quality_score) group by conversation_turn — a drop after turn 5 confirms the pattern.

Fix: implement progressive summarization (Section 3.5) or a fixed sliding window over conversation history.

Streaming connection accumulation

Streaming inference responses (Server-Sent Events) hold an open HTTP connection for the duration of the completion. A client that opens a streaming connection and never closes it holds a concurrency slot for the full requestTimeout.

Symptom: vllm:num_requests_running stays high even as traffic drops. GPU utilization is low but the scheduler reports maximum concurrency. New requests queue even though the GPU is largely idle.

Fix:

  1. Set requestTimeout on the Envoy Gateway BackendTrafficPolicy (see ingress-guide.md)
  2. Set stream_timeout in your application client:
response = client.chat.completions.create(
messages=...,
stream=True,
timeout=httpx.Timeout(connect=5, read=120, write=10, pool=5),
)
  1. Add a heartbeat check — if a streaming connection has produced no tokens in 30 seconds, close it from the client side.

Feedback and Continuous Improvement

Production traffic is your most valuable training signal. Every user interaction is a labeled example if you instrument it correctly.

Signal collection

Explicit feedback — integrate directly into your UI:

# When user rates a response
langfuse_client.score(
trace_id=trace_id,
name="user_rating",
value=rating, # 1-5 or 0/1 thumbs
comment=user_comment,
)

Implicit feedback — infer quality from behavior:

BehaviorQuality signalHow to measure
User re-prompts immediatelyBad answerTime between response and next user message < 10s
Session abandonment after responseBad answerSession ends within 30s of a response
User copies responseGood answerClipboard event or UI copy button click
Escalation to human agentFailed answerRoute change event
User continues conversationNeutral to goodAny follow-up message
# Log implicit signals as Langfuse scores
def on_user_reprompt(trace_id: str, seconds_since_response: float):
if seconds_since_response < 10:
langfuse_client.score(
trace_id=trace_id,
name="implicit_quality",
value=0,
comment=f"Re-prompted after {seconds_since_response:.1f}s",
)

Labeling infrastructure

Raw feedback signals need human review before entering a training dataset. Deploy Argilla on AKS for annotation workflows:

helm repo add argilla https://argilla-io.github.io/argilla
helm upgrade --install argilla argilla/argilla \
--namespace argilla --create-namespace \
--set replicaCount=1 \
--set resources.requests.memory="1Gi"

Labeling pipeline:

import argilla as rg
# Initialize Argilla connection
rg.init(api_url="http://argilla.argilla.svc.cluster.local:6900", api_key=...)
# Push low-rated traces to Argilla for review
def export_low_quality_traces(min_date: datetime, max_traces: int = 200):
low_quality = langfuse_client.fetch_traces(
tags=["production"],
min_score={"name": "user_rating", "op": "lt", "value": 3},
limit=max_traces,
)
records = [
rg.TextClassificationRecord(
text=trace.input["messages"][-1]["content"],
prediction=[("bad_answer", 1.0)],
annotation=None, # labeler will fill this in
metadata={
"trace_id": trace.id,
"model": trace.metadata.get("model"),
"full_context": json.dumps(trace.input),
"response": trace.output,
},
)
for trace in low_quality.data
]
rg.log(records, name="low_quality_traces", workspace="production-review")

What annotators should do: for each flagged trace, determine whether the issue is (a) wrong answer → add to fine-tuning dataset with a corrected response, (b) format violation → update prompt, (c) missing context → improve RAG retrieval, or (d) legitimate limitation → not fixable without model upgrade.

The improvement decision tree

When evaluations show a quality gap, the fix depends on the failure mode:

Quality gap observed
├─ Wrong format / style?
│ → Fix the prompt (system prompt + examples)
│ → If persistent after prompt fix → fine-tune on format
├─ Factually wrong on domain knowledge?
│ ├─ Knowledge available in documents?
│ │ → RAG: add to index, improve chunking/retrieval
│ └─ Knowledge not in documents?
│ → Fine-tune if static knowledge; accept limitation if dynamic
├─ Wrong answer on complex reasoning?
│ ├─ Passes on larger model (GPT-4o / Llama 70B)?
│ │ → Either use larger model OR fine-tune smaller model on CoT examples
│ └─ Fails on all models?
│ → Task may be inherently ambiguous; clarify requirements
├─ Inconsistent behavior (same query, different answers)?
│ → Lower temperature; use CoT; add few-shot examples; fine-tune
└─ Safety/refusal failures?
→ Add to adversarial test suite; fix system prompt; fine-tune on refusals

Detecting model drift

Unlike ML models, LLMs don’t drift due to distribution shift in inputs (they’re not trained on your data). They drift when:

  • The model is upgraded (vLLM image tag changes) — use a pinned model version
  • The underlying base model checkpoint is updated on HuggingFace — pin to a specific revision
  • Your prompt changes affect capabilities you didn’t test

Run your golden dataset eval weekly on production traffic sampling, not just at deploy time:

# .github/workflows/weekly-drift-check.yml
on:
schedule:
- cron: '0 6 * * 1' # every Monday at 6am UTC
jobs:
drift-check:
steps:
- name: Run eval against production endpoint
run: |
npx promptfoo eval \
--config promptfooconfig.yaml \
--providers vllm:http://inference-prod.yourdomain.com/v1 \
--output weekly-results.json
- name: Compare to baseline
run: |
python scripts/compare_eval_results.py \
--baseline eval-baselines/latest.json \
--current weekly-results.json \
--threshold 0.05 # alert if pass rate drops > 5%

PTU vs. Consumption for the Fallback Path

When your architecture uses Azure OpenAI as a fallback (vLLM primary → Azure OpenAI on overload), the billing model for the fallback affects your cost floor.

Consumption — pay per token, shared capacity, subject to throttling. Right when traffic is unpredictable or low.

PTU — reserved capacity, guaranteed throughput, billed per hour. Right when you can predict traffic volume and the volume justifies the reservation.

Break-even: PTU is cheaper once you exceed ~60–70% utilization of the provisioned throughput. Below that, consumption is cheaper. Use the Azure OpenAI PTU calculator with your measured TPM from production.

Hybrid strategy: PTU as primary guaranteed capacity + consumption as overflow:

<!-- APIM: PTU primary, consumption overflow on 429 -->
<retry condition="@(context.Response.StatusCode == 429)" count="1" interval="0">
<set-backend-service base-url="{{aoai-consumption-endpoint}}" />
<set-header name="api-key" exists-action="override">
<value>{{aoai-consumption-key}}</value>
</set-header>
</retry>

The interval="0" is intentional for a backend switch (no wait needed — you’re routing to a different endpoint, not retrying the same one). Do not set interval > 0 for the consumption fallback — you want immediate rerouting, not a backoff.

Recommended Stack

This stack is opinionated for the AKS-on-Azure context of this lab. Every component has a clear scope; no two overlap.

LayerToolHostingScope
Edge / WAFAzure Front Door PremiumManagedDDoS, WAF, global routing
API gatewayAzure API ManagementManagedAuth, token rate limiting, cost attribution, fallback routing
Inference enginevLLMAKS GPU poolHigh-throughput serving, prefix caching, continuous batching
AutoscalingKEDA + NAPAKSRequest-driven scale, GPU node lifecycle
Application tracingLangfuseAKS (self-hosted)Per-request traces, quality scores, cost attribution by user
System metricsAzure Managed Prometheus + GrafanaManagedvLLM metrics, GPU utilization, KEDA queue depth
Evals (offline)PromptfooCI (GitHub Actions)Pre-deploy quality gate on prompt/model changes
Evals (online)Langfuse scores + LLM-as-judgeAKS / CIContinuous quality monitoring in production
RAG retrievalAzure AI SearchManagedHybrid search (BM25 + vector), semantic ranking
GuardrailsAzure AI Content SafetyManagedPrompt Shield (input) + harm detection (output) via APIM
LabelingArgillaAKSAnnotation workflow for fine-tuning datasets
Fine-tuningKAITO QLoRA jobsAKS GPU poolLoRA/QLoRA training on A100 nodes
Model registryAzure Container RegistryManagedFine-tuned model images, digest-pinned

References

Leave a Reply