
When you call Azure OpenAI or the OpenAI API, most of the operational surface disappears: Microsoft or OpenAI manages the GPU, the model weights, the inference runtime, and the content filters. Your ops surface is prompts, evals, and cost control.
Self-hosted LLMOps is what remains when you take all of that back. You own the GPU lifecycle, the model serving process, the scaling logic, the guardrails, the observability pipeline, and the feedback loop that improves quality over time. The tradeoffs that make self-hosting worth it — data sovereignty, cost at volume, no vendor lock-in, full control over serving parameters — come with a proportional operational commitment.
LLMOps also borrows MLOps vocabulary but the failure modes are different. An ML model fails silently when its distribution drifts. An LLM fails loudly — with confident nonsense, injected instructions, hallucinated citations, or a 30-second response time that breaks your frontend timeout. The operational discipline has to match the failure mode.
This guide covers the full lifecycle: observability at each layer of the stack, evaluation design, prompt engineering, RAG, fine-tuning, cost optimization, CI/CD, and the feedback loops that close the improvement cycle.
The implementations use AKS — KAITO, NAP, KEDA, APIM, Workload Identity as discussed on https://rtrentinsworld.com/2026/03/27/running-llm-inference-on-aks/ — but the operational patterns apply to any self-hosted inference deployment.
Observability
Inference observability operates at two distinct layers that require different tools and answer different questions.
System layer — what is the GPU doing? Is the KV cache full? Are requests queueing? This is answered by vLLM’s Prometheus metrics, surfaced in the lab’s Azure Managed Grafana.
Application layer — which prompt produced a bad answer? Which user session had high latency? What was the token distribution across requests? This is answered by a tracing tool like Langfuse that captures the semantic content of each call.
You need both. System metrics tell you the machine is sick; application traces tell you which patient is suffering.
Latency has three components — measure all three
Most teams measure only end-to-end response time and miss two diagnostically distinct signals:
| Metric | Definition | What causes it | SLO target |
|---|---|---|---|
| TTFT — Time to First Token | Wall clock from request send to first token received | Prefill phase: processing the entire input prompt | < 500ms for chat, < 3s for long-context RAG |
| TPOT — Time Per Output Token | Average milliseconds per generated token after the first | Decode phase: GPU throughput | < 30ms/token for real-time chat |
| E2E latency | Total request time | TTFT + (completion_tokens × TPOT) + network | Function of both above + payload size |
Why this matters: TTFT and TPOT have different root causes and different fixes. High TTFT means your prefill is too long (large context, no prefix caching, or the scheduler is overwhelmed). High TPOT means your GPU is undersized or oversubscribed. Measuring only E2E hides which knob to turn.
vLLM metrics — the essential set
vLLM exposes a Prometheus endpoint at /metrics. With the lab’s Azure Managed Prometheus scraping vLLM pods, these queries go directly into Grafana.
Request queue health:
# Requests waiting for a GPU slot — the primary scaling signalvllm:num_requests_waiting{namespace="inference"}# Running sequences — are we at max-num-seqs capacity?vllm:num_requests_running{namespace="inference"}
When num_requests_waiting is consistently > 0, you have more demand than GPU capacity. KEDA should be watching this.
KV cache utilization — the throughput governor:
# KV cache fill rate — aim for 70-85% at peak, not 95%+vllm:gpu_cache_usage_perc{namespace="inference"}
At 95%+ KV cache utilization, vLLM starts evicting blocks from queued sequences. This causes TTFT spikes as prefills get re-processed. Size max-num-seqs so you hit 80-85% at expected peak, not at maximum concurrency.
Token throughput — your cost efficiency signal:
# Tokens generated per second (decode throughput)rate(vllm:generation_tokens_total{namespace="inference"}[5m])# Tokens processed in prefill per secondrate(vllm:prompt_tokens_total{namespace="inference"}[5m])
A healthy vLLM pod on a T4 with Phi-4 Mini should sustain 2,000–4,000 generation tokens/second. If you’re seeing 500 tokens/second at moderate load, the GPU is undersized or there’s a scheduling pathology.
Latency percentiles — for SLO compliance:
# p95 time to first tokenhistogram_quantile(0.95, rate(vllm:time_to_first_token_seconds_bucket{namespace="inference"}[10m]))# p95 time per output tokenhistogram_quantile(0.95, rate(vllm:time_per_output_token_seconds_bucket{namespace="inference"}[10m]))# p95 end-to-end latencyhistogram_quantile(0.95, rate(vllm:e2e_request_latency_seconds_bucket{namespace="inference"}[10m]))
GPU hardware metrics — from the DCGM exporter (manifests/monitoring/dcgm-exporter.yaml):
# GPU compute utilization — should be 70-90% under loadDCGM_FI_DEV_GPU_UTIL{namespace="inference"}# GPU memory used vs totalDCGM_FI_DEV_FB_USED{namespace="inference"} / DCGM_FI_DEV_FB_TOTAL{namespace="inference"}# GPU temperature — alert above 85°CDCGM_FI_DEV_GPU_TEMP{namespace="inference"}
Low GPU utilization (< 40%) at peak load means the GPU is waiting on something — likely the KV scheduler, CPU tokenization, or a max-num-seqs ceiling that’s too low. High utilization (> 95%) with growing request queues means you need more replicas.
APIM metrics — cost attribution at the consumer level
APIM’s Application Insights integration provides the token attribution data that vLLM doesn’t have: which consumer is sending how many tokens.
Configure token emission in the APIM policy outbound section:
<outbound> <base /> <llm-emit-token-metric namespace="InferenceTokens"> <dimension name="consumer-id" value="@(context.Subscription.Id)" /> <dimension name="model" value="@(context.Request.Body.As<JObject>(preserveContent: true)["model"]?.ToString() ?? "unknown")" /> <dimension name="environment" value="@(context.Deployment.ServiceName.Contains("prod") ? "prod" : "dev")" /> </llm-emit-token-metric></outbound>
This feeds a Log Analytics query for monthly chargeback by team:
customMetrics| where name == "InferenceTokens"| summarize total_tokens = sum(value) by tostring(customDimensions["consumer-id"]), bin(timestamp, 1d)| order by total_tokens desc
Application-level tracing with Langfuse
Langfuse captures the semantic content of each LLM call: which prompt, which response, latency, token counts, and any scores you attach. This is where debugging happens when a user reports a bad answer.
Deploy Langfuse on AKS:
helm repo add langfuse https://langfuse.github.io/langfuse-k8shelm upgrade --install langfuse langfuse/langfuse \ --namespace langfuse --create-namespace \ --set langfuse.nextauth.secret="$(openssl rand -hex 32)" \ --set langfuse.salt="$(openssl rand -hex 16)" \ --set postgresql.enabled=true \ --set postgresql.auth.database=langfuse \ --set langfuse.resources.requests.memory="512Mi" \ --set langfuse.resources.requests.cpu="250m"
Route through the cluster’s Envoy Gateway for internal access. For external access, put Langfuse behind APIM with AAD auth — it contains prompt content and should not be publicly accessible.
SDK instrumentation (Python):
from langfuse import Langfusefrom langfuse.decorators import observe, langfuse_contextlangfuse = Langfuse( public_key=os.environ["LANGFUSE_PUBLIC_KEY"], secret_key=os.environ["LANGFUSE_SECRET_KEY"], host=os.environ["LANGFUSE_HOST"], # internal cluster URL)@observe()def generate_response(user_query: str, session_id: str) -> str: langfuse_context.update_current_trace( user_id=session_id, tags=["production", "customer-support"], ) # Retrieve context (if RAG) chunks = retrieve(user_query) langfuse_context.update_current_observation( input={"query": user_query, "chunks_retrieved": len(chunks)}, ) # Call vLLM (OpenAI-compatible endpoint) response = openai_client.chat.completions.create( model="phi-4-mini-instruct", messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": f"Context:\n{chunks}\n\nQuestion: {user_query}"} ], max_tokens=512, ) output = response.choices[0].message.content # Attach quality score if you have one langfuse_context.score_current_trace( name="groundedness", value=check_groundedness(output, chunks), ) return output
Trace correlation across APIM and vLLM
Propagate a trace ID from APIM through to the application so a single request is traceable across all layers:
<!-- APIM inbound: generate and forward trace ID --><set-header name="X-Trace-Id" exists-action="skip"> <value>@(Guid.NewGuid().ToString())</value></set-header><set-variable name="traceId" value="@(context.Request.Headers.GetValueOrDefault("X-Trace-Id"))" />
Your application reads X-Trace-Id from the incoming request and passes it to Langfuse as the trace_id. This lets you correlate a Langfuse trace with APIM logs, vLLM logs, and Kubernetes pod logs using a single ID.
Evaluation
Evals are the tests for your LLM system. Without them, you cannot safely change a prompt, upgrade a model, or modify a RAG pipeline — you’re deploying blind.
The hardest part is not the tooling. It’s defining what “correct” means for your task, assembling representative test cases, and deciding which failure modes matter most. The tooling is secondary.
What you’re actually testing
The eval surface for an LLM system has three layers, and they require different techniques:
| Layer | What changes | Eval technique |
|---|---|---|
| Prompt | Wording, instructions, examples | Golden dataset comparison |
| Model | Weights, quantization, version | Benchmark regression |
| RAG pipeline | Chunking, retrieval, re-ranking | Retrieval + faithfulness metrics |
Each layer has different change frequency. Prompts change most often (weekly in active development). Models change occasionally (when a new version drops). RAG pipeline changes when document corpus or retrieval quality issues are found.
Building your first golden dataset
The bootstrap problem: you need test cases before you have production data, but the best test cases come from production failures. How to start:
Step 1 — Write 20 cases by hand. Cover the happy path (typical query, good answer), three known edge cases, and two adversarial inputs. Write the expected answer in detail — not “a correct answer” but the specific things it must contain.
Step 2 — Generate synthetic variants. Use GPT-4o or a strong model to paraphrase your 20 cases into 60–80 variants. Prompt: “Generate 4 rephrased versions of this user question that ask the same thing differently.” This gives you coverage without manual effort.
Step 3 — Collect production failures once deployed. Every time a user flags a bad answer (thumbs down, escalation, correction), add it to the dataset. Production failures are worth 10× synthetic cases because they represent real failure modes you didn’t anticipate.
Step 4 — Balance the dataset. Check that your cases cover the full distribution of your real traffic — length, topic, complexity. A dataset of 100 short simple questions will pass with flying colors while the 20% of long complex queries fail in production.
Minimum viable dataset size:
- 50–100 cases: can detect changes ≥ 10% in quality
- 200–500 cases: can detect changes ≥ 5%, meaningful regression testing
- 1,000+ cases: statistical confidence for fine-grained comparison
For a 95% confidence interval with 5% margin of error, you need ~385 test cases. For 2% margin of error, ~2,400. Budget accordingly.
Assertion types
Deterministic assertions — for outputs with a known right answer:
# promptfooconfig.yamltests: - vars: query: "What port does vLLM listen on by default?" assert: - type: contains value: "8000" - type: not-contains value: "8080" # common wrong answer - type: javascript value: | output.length < 200 # reject verbose responses
Use deterministic assertions for: factual questions, structured output format, required keywords, output length constraints.
LLM-as-judge — for quality dimensions that can’t be checked with string matching:
tests: - vars: context: "The document says X, Y, and Z." query: "Summarize the key points." assert: - type: llm-rubric value: | The response should: 1. Mention X, Y, and Z from the provided context 2. Not introduce information not present in the context 3. Be written in 2-4 sentences 4. Not start with "Certainly!" or similar filler
Custom validator (Python) for task-specific checks:
# In your test suitedef check_json_output(output: str, context: dict) -> dict: """Validate structured output is valid JSON matching expected schema.""" import json from jsonschema import validate expected_schema = { "type": "object", "required": ["category", "confidence", "reason"], "properties": { "category": {"type": "string", "enum": ["billing", "technical", "account"]}, "confidence": {"type": "number", "minimum": 0, "maximum": 1}, "reason": {"type": "string"}, } } try: parsed = json.loads(output) validate(parsed, expected_schema) return {"pass": True, "score": parsed["confidence"]} except Exception as e: return {"pass": False, "reason": str(e)}
LLM-as-judge — implementation details
Using an LLM to evaluate another LLM’s output is powerful but has documented biases:
- Position bias: the judge prefers the first answer when comparing two
- Verbosity bias: the judge prefers longer responses even when they’re less accurate
- Self-enhancement bias: GPT-4o ranks GPT-4o outputs higher; use a different family as judge
Calibrated judge prompt pattern:
JUDGE_PROMPT = """You are an expert evaluator for a {task_type} system.Evaluate the following response on a scale of 1-5 for {dimension}:1 = Completely wrong or harmful2 = Mostly wrong with minor correct elements3 = Partially correct but missing key information4 = Mostly correct with minor issues5 = Completely correct and well-formedTask: {task_description}User question: {user_query}{context_block}Response to evaluate:{response}Provide your evaluation in this exact JSON format:{{ "score": <1-5>, "reasoning": "<one sentence explaining the score>", "key_issues": ["<issue 1>", "<issue 2>"]}}Do not consider response length in your score. Evaluate only accuracy and completeness."""
Calibration: Before using LLM-as-judge at scale, label 50–100 examples yourself and measure judge-human agreement (Cohen’s kappa). Target kappa > 0.6 (substantial agreement). If the judge disagrees with your labels on > 30% of cases, revise the prompt.
RAG-specific evaluation with RAGAS
RAGAS evaluates the full RAG pipeline — not just the final answer but the retrieval quality.
from datasets import Datasetfrom ragas import evaluatefrom ragas.metrics import ( faithfulness, answer_relevancy, context_precision, context_recall,)# Build evaluation datasetdata = { "question": ["What is the maximum context length for Phi-4 Mini?"], "answer": ["Phi-4 Mini supports up to 128K context tokens."], # model output "contexts": [["Phi-4 Mini has a 128K context window and 3.8B params"]], # retrieved chunks "ground_truth": ["Phi-4 Mini has a 128K token context window."], # expected answer}dataset = Dataset.from_dict(data)results = evaluate( dataset=dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall], llm=your_azure_openai_client, # judge model — use a stronger model than the one you're evaluating)
| Metric | What a low score means | Fix |
|---|---|---|
| Faithfulness | Answer contains information not in retrieved chunks | Reduce hallucination: lower temperature, add “only use the provided context” instruction |
| Answer relevancy | Answer doesn’t address the question | Improve generation prompt instructions |
| Context precision | Retrieved chunks contain lots of irrelevant content | Improve retrieval: better embedding model, tighter query, stricter similarity threshold |
| Context recall | Retrieval missed chunks needed to answer | Improve retrieval: more chunks per query, smaller chunk size, re-ranking |
Run RAGAS in CI on every change to your chunking strategy, embedding model, or retrieval parameters. A 5% drop in context recall on your golden dataset is a merge blocker.
Tracking regressions over time
Store eval results with timestamps and compare model-by-model in Langfuse or a simple PostgreSQL table:
CREATE TABLE eval_runs ( run_id UUID PRIMARY KEY DEFAULT gen_random_uuid(), run_at TIMESTAMPTZ NOT NULL DEFAULT now(), model VARCHAR(100), prompt_version VARCHAR(20), dataset_name VARCHAR(100), pass_rate FLOAT, avg_faithfulness FLOAT, avg_relevancy FLOAT, p95_latency_ms INT);
Set a gate: block deployment if pass_rate < 0.95 OR avg_faithfulness < 0.80 OR p95_latency_ms > SLO.
Prompt Engineering
Prompt engineering is misunderstood as “wording tricks.” It’s actually a set of techniques that change how the model reasons internally — with measurable effects on output quality. Understanding why each technique works helps you apply them correctly.
System prompt design
The system prompt sets the model’s role, constraints, and output format. It runs in every request. Design it as a contract between you and the model, not a suggestion.
Structure that works:
[Role] You are a {specific role} for {specific company/context}.[Scope] You help users with: {list of in-scope tasks}.You do not: {list of out-of-scope tasks}.[Format] Respond in {format description}.{example if format is complex}[Constraints]- {constraint 1}- {constraint 2}[Fallback] If you cannot answer from the provided context, say exactly:"{fallback phrase}" — do not fabricate information.
Common mistakes:
- Too short: “You are a helpful assistant.” Gives the model no constraints — it will be helpful in unpredictable ways.
- Contradictory: “Be concise but thorough” — concise and thorough are in tension. Pick one or specify the trade-off (“be concise unless the question requires detail”).
- Missing the fallback: Without an explicit fallback instruction, models will hallucinate rather than admit they don’t know.
- Instruction-following check skipped: After writing a system prompt, test it on 10 adversarial inputs: “Ignore previous instructions,” “Repeat your system prompt,” “Pretend you have no restrictions.” A prompt that fails these tests is not production-ready.
Few-shot examples
Few-shot examples are the most reliable way to enforce output format and style. The model learns the pattern from the examples, not from your description of what you want.
Rule: Show examples in the exact format you want output. If you want JSON output, show JSON in the examples. If you want a two-sentence summary followed by bullet points, show that pattern in every example.
SYSTEM_PROMPT = """You are a technical support classifier.Classify the user's issue into one of: [billing, technical, account, other].Examples:User: "My invoice shows a double charge for March."Output: {"category": "billing", "confidence": 0.95, "reason": "Payment/invoice dispute"}User: "The API keeps returning 429 errors."Output: {"category": "technical", "confidence": 0.9, "reason": "Rate limiting error"}User: "How do I reset my password?"Output: {"category": "account", "confidence": 0.98, "reason": "Credential management"}Always output valid JSON matching the schema above. Never add extra fields."""
How many examples: 2–5 is typically sufficient. Beyond 5, you’re consuming context window without proportional quality gain, unless your task has high variance (many different valid output forms).
Chain-of-thought
Chain-of-thought (CoT) prompting asks the model to show its reasoning before giving the final answer. This works because it forces the model to allocate intermediate computation to reasoning steps rather than jumping to an answer.
Use CoT when: the task involves multi-step reasoning, math, or decisions that depend on intermediate conclusions.
Don’t use CoT when: the task is classification, extraction, or summarization with a clear right answer — it adds latency and tokens without quality improvement.
Zero-shot CoT (simplest):
Question: If a T4 GPU has 16GB VRAM and a model uses 14.6GB for weights,how many concurrent sequences can it run at max-model-len 4096?Think through this step by step, then give the final answer.
Few-shot CoT (more reliable):
Question: Calculate GPU tier needed for Llama 3.3 70B at int8.Reasoning:- Parameters: 70.6B- int8 bytes per param: 1.0- Weights memory: 70.6B × 1.0 = 70.6 GB- Apply 1.3× headroom: 70.6 × 1.3 = 91.8 GB needed- T4 (16GB): too small- A100 80GB: 80 × 0.90 = 72 GB usable < 91.8 GB: too small- 2× A100 80GB: 160 × 0.90 = 144 GB usable > 91.8 GB: sufficientAnswer: 2× A100 80GB (NC48ads_A100_v4)Question: Calculate GPU tier needed for Phi-4 Mini at fp16.Reasoning:<model completes the pattern>
Structured output
For tasks that produce JSON, Markdown tables, or other structured formats, reliability matters. Three techniques in order of reliability:
Option 1 — JSON mode (vLLM / OpenAI API):
response = client.chat.completions.create( model="phi-4-mini-instruct", messages=[...], response_format={"type": "json_object"}, # forces JSON output)
JSON mode guarantees syntactically valid JSON but not schema compliance.
Option 2 — Grammar-constrained decoding (vLLM, most reliable):
from vllm import LLM, SamplingParamsschema = { "type": "object", "properties": { "category": {"type": "string", "enum": ["billing", "technical", "account"]}, "confidence": {"type": "number"}, }, "required": ["category", "confidence"]}sampling_params = SamplingParams( guided_decoding={"json": json.dumps(schema)} # tokens that violate schema are masked)
Grammar-constrained decoding modifies the token probability distribution at each step so only tokens that keep the output valid are sampled. 100% schema compliance, no retry logic needed.
Option 3 — Guardrails AI (post-processing validation):
from guardrails import Guardfrom guardrails.hub import ValidJSON, ValidChoicesguard = Guard().use_many( ValidJSON(), ValidChoices(choices=["billing", "technical", "account"], on_fail="reask"),)response = guard( openai_client.chat.completions.create, prompt="Classify this support ticket: ...", model="phi-4-mini-instruct", max_tokens=200,)
Guardrails AI retries up to N times with an error message injected into the context, asking the model to fix its output.
Context window management
Long conversations degrade quality. As the context grows, models give less attention to the system prompt and early instructions. At 60–80% of the context window, instruction following typically degrades.
Three mitigation strategies:
Progressive summarization:
def manage_context(messages: list, model_max_tokens: int, reserve_tokens: int = 1000) -> list: """Summarize old messages when context approaches limit.""" current_tokens = count_tokens(messages) limit = model_max_tokens - reserve_tokens # reserve for completion if current_tokens < limit * 0.7: return messages # no action needed # Keep system prompt + last 4 turns + summarize the rest system = [m for m in messages if m["role"] == "system"] recent = messages[-4:] to_summarize = [m for m in messages if m not in system and m not in recent] if not to_summarize: return messages summary = summarize_conversation(to_summarize) # call LLM to summarize summary_msg = {"role": "assistant", "content": f"[Previous conversation summary: {summary}]"} return system + [summary_msg] + recent
Selective context injection (RAG conversations): Instead of accumulating the full conversation, re-retrieve context on each turn. The user’s latest message contains most of the retrieval signal needed — prior turns add diminishing value.
Fixed-size sliding window: For multi-turn chat, keep only the last N turns. Simple and effective for most chatbot use cases. N=10 turns covers 95%+ of real conversations while keeping context manageable.
RAG Patterns
RAG adds a retrieval step that makes the model’s answer dependent on your documents, not its training data. This is correct for domain-specific, frequently-changing, or private information. The tradeoff: quality is now bounded by both retrieval quality and generation quality.
Chunking — the upstream bottleneck
Bad chunking propagates through the entire pipeline. A missed fact at the chunking step cannot be recovered by better retrieval or a better model.
Fixed-size chunking with overlap:
from langchain.text_splitter import RecursiveCharacterTextSplittersplitter = RecursiveCharacterTextSplitter( chunk_size=512, # tokens, not characters chunk_overlap=50, # ~10% overlap to avoid boundary splits length_function=lambda text: len(tokenizer.encode(text)), # use target model's tokenizer)
Why 512 tokens: at this size, each chunk contains roughly one coherent topic. Larger chunks increase recall but decrease precision (more noise per retrieved chunk). Smaller chunks increase precision but miss context that spans multiple sentences.
Sentence-aware chunking (better for prose):
from langchain.text_splitter import SpacyTextSplitter# Respects sentence boundaries — never splits mid-sentencesplitter = SpacyTextSplitter(chunk_size=512)
Code-aware chunking (critical for technical docs):
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter# Splits at function/class boundaries, not arbitrary character positionssplitter = RecursiveCharacterTextSplitter.from_language( language=Language.PYTHON, chunk_size=1000, chunk_overlap=100,)
For codebases, splitting at the function level (using the AST) outperforms fixed-size splitting by 15–25% on code retrieval tasks. Each function is a semantic unit — a fixed-size splitter cuts functions in half.
Metadata enrichment at index time: Attach metadata to every chunk before storing it. This enables filtered retrieval later:
chunks_with_metadata = [ { "content": chunk.page_content, "metadata": { "source": document_path, "section": extract_section_heading(chunk), "doc_type": "technical_guide", "last_updated": document_date, "language": "en", } } for chunk in chunks]
Retrieval strategies
Sparse + dense hybrid retrieval: Neither BM25 (keyword) nor vector (semantic) retrieval dominates across all query types. Sparse retrieval is better for exact term matching (product codes, error messages, proper nouns). Dense retrieval is better for semantic similarity (“how do I fix latency” ↔ “TTFT optimization”).
Combining them consistently outperforms either alone.
from azure.search.documents import SearchClientfrom azure.search.documents.models import VectorizedQuerydef hybrid_retrieve(query: str, top_k: int = 20) -> list: """Combine BM25 and vector search, return top-k by reciprocal rank fusion.""" query_embedding = embed(query) results = search_client.search( search_text=query, # BM25 path vector_queries=[ VectorizedQuery( vector=query_embedding, k_nearest_neighbors=top_k, fields="content_vector" # dense path ) ], query_type="semantic", # rerank with semantic model semantic_configuration_name="inference-config", top=top_k, select=["content", "source", "section", "metadata"], ) return list(results)
Azure AI Search handles the fusion and semantic re-ranking natively when query_type="semantic".
Filtered retrieval — scoped to relevant documents:
results = search_client.search( search_text=query, filter="doc_type eq 'technical_guide' and last_updated ge 2026-01-01", top=10,)
Filtering before retrieval is more efficient than filtering top-N results after retrieval. Set filters based on available metadata — document type, recency, access level, user context.
HyDE (Hypothetical Document Embedding): For queries that are short and abstract (“how does KEDA scaling work?”), the query embedding is often too sparse to retrieve the right chunks. HyDE generates a hypothetical answer first, embeds the answer rather than the query, and retrieves documents similar to the hypothetical answer.
def hyde_retrieve(query: str, llm_client, top_k: int = 5) -> list: """Retrieve using a hypothetical answer embedding instead of the raw query.""" # Generate a hypothetical ideal answer (doesn't need to be accurate) hyde_response = llm_client.chat.completions.create( model="phi-4-mini-instruct", messages=[{ "role": "user", "content": f"Write a 3-sentence technical explanation that would answer: {query}" }], max_tokens=150, ) hypothetical_answer = hyde_response.choices[0].message.content # Embed the hypothetical answer and retrieve embedding = embed(hypothetical_answer) return vector_search(embedding, top_k=top_k)
HyDE improves recall on abstract or paraphrased queries by 10–20% at the cost of one additional LLM call.
Re-ranking
Retrieval returns candidates. Re-ranking selects the best ones. A cross-encoder re-ranker reads the query and each document together and produces a relevance score — it’s slower than embedding similarity but significantly more accurate.
from sentence_transformers import CrossEncoderreranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")def rerank(query: str, candidates: list[str], top_n: int = 5) -> list[str]: """Score query-document pairs and return top_n.""" pairs = [(query, doc) for doc in candidates] scores = reranker.predict(pairs) ranked = sorted(zip(scores, candidates), reverse=True) return [doc for _, doc in ranked[:top_n]]
Retrieval strategy by use case:
| Use case | Strategy |
|---|---|
| Simple Q&A over structured docs | Dense-only, top-5, no re-rank (fast) |
| Technical support over mixed content | Hybrid (BM25 + dense), re-rank top-20 → top-5 |
| Legal/compliance document search | Hybrid + metadata filter + re-rank + citation |
| Multi-hop questions (answer requires >1 doc) | Iterative retrieval or graph-based RAG |
Multi-hop retrieval for complex questions
Some questions cannot be answered from a single chunk — the answer requires combining information across multiple documents. Standard single-shot retrieval fails here.
Iterative retrieval:
def multi_hop_retrieve(question: str, max_hops: int = 3) -> list: all_contexts = [] current_query = question for hop in range(max_hops): chunks = retrieve(current_query, top_k=3) all_contexts.extend(chunks) # Ask the model: do we have enough information? If not, what do we still need? reflection = llm_client.chat.completions.create( model="phi-4-mini-instruct", messages=[{ "role": "user", "content": f"""Question: {question}Retrieved so far:{format_chunks(all_contexts)}Can you fully answer the question with the above context?If yes, respond: "COMPLETE"If no, respond with the specific follow-up question needed to find the missing information.""" }], max_tokens=100, ).choices[0].message.content if reflection.strip() == "COMPLETE" or hop == max_hops - 1: break current_query = reflection # next hop uses the model's follow-up question return all_contexts
Fine-Tuning on AKS
Fine-tuning is often reached for too early. Before investing in it, try prompt engineering and RAG — they’re faster to iterate. Fine-tune when:
- Latency/cost reduction: you need GPT-4-level task quality from a T4-tier model. A 7B model fine-tuned on your specific task often outperforms a 70B general model on that task.
- Consistent structured output: the model needs to reliably produce a specific JSON schema or output format that prompt engineering can’t reliably enforce.
- Style and voice: the model needs to write in a specific brand voice or follow house style that’s difficult to describe in a prompt.
- Knowledge consolidation: you have proprietary data that changes infrequently and can be baked into the weights. Note: for frequently-changing data, RAG is almost always better.
Don’t fine-tune when:
- Your task success rate is below 70% on your eval set — the model doesn’t understand the task at all. More data won’t fix a fundamentally wrong model; fix your prompt first.
- You have fewer than 500 high-quality labeled examples. Fine-tuning on low-quality or insufficient data produces a model that confidently does the wrong thing.
- Your use case is adding new knowledge (facts, documents, product catalog). Models don’t reliably memorize facts through fine-tuning; they learn behavioral patterns. Use RAG.
LoRA and QLoRA — what you’re actually training
Full fine-tuning updates all weights — computationally prohibitive for 7B+ parameter models on single GPUs. LoRA (Low-Rank Adaptation) is a parameter-efficient technique that freezes the original weights and adds small trainable adapter matrices.
The math, briefly: instead of updating a weight matrix W (size d×d), LoRA adds two matrices A (d×r) and B (r×d) where r is the “rank” — typically 8–64. Total trainable parameters: 2 × d × r instead of d². At rank 16 for a 7B model, you train ~0.1% of the parameters while preserving most quality.
QLoRA runs LoRA on a 4-bit quantized base model, halving the VRAM required for training. A Mistral 7B fine-tune that requires 30GB on a standard A100 runs in ~12GB with QLoRA — fits on a single A100 80GB node.
Dataset preparation
The quality of your training data determines the ceiling of your fine-tuned model.
Format for instruction fine-tuning:
{"messages": [ {"role": "system", "content": "You are a customer support agent for AKS AI Lab."}, {"role": "user", "content": "My GPU node didn't provision. What do I check?"}, {"role": "assistant", "content": "Check these in order:\n1. Run `kubectl get nodeclaim -n karpenter` — look for a Failed NodeClaim\n2. Check AKS quota: `az vm list-usage -l eastus | grep NC`\n3. Check the Karpenter controller logs: `kubectl logs -n karpenter deployment/karpenter | grep ERROR`"}]}
Minimum viable dataset sizes:
| Goal | Min examples | Notes |
|---|---|---|
| Format/style adaptation | 500 | Model already knows the domain; you’re shaping output style |
| Domain-specific knowledge | 2,000–5,000 | Model needs to learn new facts + format |
| Task specialization | 1,000–3,000 | High-quality examples matter more than quantity |
| Safety/refusal training | 500+ (+ neg. examples) | Include both positive and “this should be refused” pairs |
Quality checklist before training:
- Every example is correct — wrong examples actively degrade the model
- No duplicate or near-duplicate examples (deduplicate by semantic similarity)
- Coverage is balanced — check topic/length/complexity distribution
- No PII in training data
- Adversarial inputs have appropriate refusal responses
- Output format is consistent across all examples
Training with KAITO on AKS
KAITO supports QLoRA fine-tuning jobs via a Workspace CRD with inference: false and a tuning spec:
apiVersion: kaito.sh/v1alpha1kind: Workspacemetadata: name: finetune-mistral-7b namespace: inferencespec: resource: instanceType: "Standard_NC24ads_A100_v4" # A100 80GB for training labelSelector: matchLabels: apps: mistral-7b-finetune tuning: preset: name: mistral-7b-v0.3 method: qlora input: urls: - "https://<storage>.blob.core.windows.net/training/dataset.jsonl" # Workload Identity auth output: image: "<your-acr>.azurecr.io/mistral-7b-support:v1" imagePushSecret: acr-push-secret config: # LoRA hyperparameters lora_rank: 16 lora_alpha: 32 lora_dropout: 0.05 target_modules: ["q_proj", "v_proj", "k_proj", "o_proj"] # Training config num_epochs: 3 per_device_train_batch_size: 4 gradient_accumulation_steps: 4 # effective batch = 16 learning_rate: 2.0e-4 warmup_ratio: 0.03 lr_scheduler_type: "cosine" # Memory optimization gradient_checkpointing: true bf16: true # A100 supports bfloat16 max_seq_length: 2048
LoRA hyperparameter guidance:
lora_rank: Start at 16. Increase to 32–64 if quality is poor; higher rank = more expressiveness but more parameters.lora_alpha: Set to 2×lora_rankas a starting point. Controls the magnitude of the LoRA update.target_modules: For most transformer models,["q_proj", "v_proj"]is the minimum. Addingk_proj,o_proj, and MLP layers (gate_proj,up_proj,down_proj) increases quality at the cost of more parameters.learning_rate: 1e-4 to 3e-4 for QLoRA. Higher than standard fine-tuning because you’re training fewer parameters.num_epochs: 2–5. Monitor validation loss — if it starts rising, stop early.
Evaluating the fine-tuned model
Never deploy a fine-tuned model based on training loss alone. Training loss measures fit to the training set, not generalization or task quality.
Evaluation pipeline:
def evaluate_fine_tuned_model( base_model_client, finetuned_model_client, eval_dataset: list[dict],) -> dict: """Run eval on both models, compare on quality and format compliance.""" results = {"base": [], "finetuned": []} for example in eval_dataset: for name, client in [("base", base_model_client), ("finetuned", finetuned_model_client)]: response = client.chat.completions.create( messages=example["messages"][:-1], # exclude gold response max_tokens=512, temperature=0, ) output = response.choices[0].message.content results[name].append({ "output": output, "latency_ms": response.usage.completion_tokens * 30, # rough estimate "format_valid": check_format(output, example["expected_format"]), "judge_score": llm_judge(example["messages"][-2]["content"], output), }) return { "base_pass_rate": mean(r["format_valid"] for r in results["base"]), "finetuned_pass_rate": mean(r["format_valid"] for r in results["finetuned"]), "base_quality": mean(r["judge_score"] for r in results["base"]), "finetuned_quality": mean(r["judge_score"] for r in results["finetuned"]), "quality_delta": mean(r["judge_score"] for r in results["finetuned"]) - mean(r["judge_score"] for r in results["base"]), }
Promotion gate: deploy the fine-tuned model only if:
quality_delta > 0.10(≥ 10% quality improvement)finetuned_pass_rate > 0.95(95% format compliance)- p95 latency ≤ SLO (fine-tuning doesn’t change model size, but verify)
- No regression on held-out adversarial/safety examples
Cost Optimization
GPU inference is expensive. The three levers are: run the smallest adequate model, reduce token count, and avoid redundant computation.
The actual cost model
Cost per request = (prompt_tokens + completion_tokens) × $/token = prompt_tokens × (GPU $/hr) / (prompt_throughput tok/s × 3600) + completion_tokens × (GPU $/hr) / (generation_throughput tok/s × 3600)
Completion tokens cost significantly more than prompt tokens because generation is sequential (one token per forward pass), while prompts can be processed in parallel. On a T4 with Phi-4 Mini:
- Prompt processing: ~15,000 tokens/second (parallel)
- Generation: ~3,000 tokens/second (sequential)
A request with 500 prompt tokens + 300 completion tokens:
Prompt cost: 500 / 15,000 × $0.53/hr / 3,600 = $0.0000049Generation: 300 / 3,000 × $0.53/hr / 3,600 = $0.0000147Total: ~$0.000020 per request
At 50,000 requests/day: $1.00/day in GPU time. The system node pool is $8.88/day. At this volume, infrastructure cost dominates.
Prefix caching — the highest-impact optimization
If multiple requests share the same system prompt or conversation prefix, vLLM can reuse the KV cache for those tokens instead of recomputing them. This is called automatic prefix caching (APC).
Enable it for free:
# In manifests/vllm/vllm-standalone.yamlargs: - --enable-prefix-caching
Impact: for a chatbot with a 500-token system prompt, every second-and-beyond turn in the conversation reuses those 500 tokens from cache. At 10 turns per session and 10,000 sessions/day, this eliminates 45M token computations per day — roughly 4× the GPU throughput for the same hardware.
Measuring cache effectiveness:
# Cache hit rate — should be > 50% for chatbot use casesvllm:cache_config_info{namespace="inference"}rate(vllm:request_success_total{finished_reason="length", namespace="inference"}[5m])
Monitor the hit rate. If it’s below 20% for a chatbot use case, check that your system prompt is truly identical across requests (whitespace differences break cache matches).
Exact and semantic caching
Exact caching (Redis) — for repeated identical queries:
import hashlibimport rediscache = redis.Redis(host="redis.inference.svc.cluster.local", port=6379)CACHE_TTL = 3600 # 1 hourdef cached_inference(messages: list, model: str, **kwargs) -> str: cache_key = hashlib.sha256( json.dumps({"messages": messages, "model": model}).encode() ).hexdigest() if cached := cache.get(cache_key): return json.loads(cached) response = llm_client.chat.completions.create( messages=messages, model=model, **kwargs ) result = response.choices[0].message.content cache.setex(cache_key, CACHE_TTL, json.dumps(result)) return result
Best for: FAQ bots, documentation queries, classification tasks where users ask the same things repeatedly.
Semantic caching — for near-duplicate queries:
from sklearn.metrics.pairwise import cosine_similarityimport numpy as npclass SemanticCache: def __init__(self, similarity_threshold: float = 0.95): self.threshold = similarity_threshold self.cache: list[tuple[np.ndarray, str, str]] = [] # (embedding, query, response) def get(self, query: str) -> str | None: query_emb = np.array(embed(query)) for stored_emb, stored_query, stored_response in self.cache: sim = cosine_similarity([query_emb], [stored_emb])[0][0] if sim >= self.threshold: return stored_response return None def set(self, query: str, response: str): self.cache.append((np.array(embed(query)), query, response))
Important caveat: semantic caching introduces latency for the embedding call (10–50ms). Only worthwhile if your inference latency is high (> 500ms) and your query repetition rate is high (> 30%). Measure before deploying.
Model cascade — route by task complexity
Not every request needs your most capable model. A model cascade routes simple requests to a cheap fast model and complex requests to a powerful model.
ROUTING_PROMPT = """Classify this request's complexity:- "simple": factual lookup, yes/no, short answer, format conversion- "complex": multi-step reasoning, code generation, analysis, comparisonRequest: {query}Respond with only "simple" or "complex"."""def cascade_route(query: str, context: str) -> str: # Use a tiny fast model to classify request complexity complexity = phi4_mini_client.chat.completions.create( messages=[{"role": "user", "content": ROUTING_PROMPT.format(query=query)}], max_tokens=5, temperature=0, ).choices[0].message.content.strip().lower() if complexity == "simple": client = phi4_mini_client # T4, ~$0.53/hr else: client = llama70b_client # 2× A100, ~$7.34/hr return client.chat.completions.create( messages=[{"role": "system", "content": context}, {"role": "user", "content": query}], max_tokens=512, ).choices[0].message.content
The classifier call (Phi-4 Mini) costs ~$0.000002. If 70% of queries are “simple” and the complex model costs 30× more, the cascade saves ~50% on inference cost with negligible quality degradation on the simple tier.
Prompt compression
Long prompts are expensive to process and take VRAM for KV cache. For RAG use cases where you’re injecting large retrieval contexts, consider compressing the context before sending it to the model.
LLMLingua strips tokens from the prompt that don’t contribute to the answer while preserving the information needed to answer the query:
from llmlingua import PromptCompressorcompressor = PromptCompressor( model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank", use_llmlingua2=True,)def compress_context(question: str, retrieved_chunks: list[str]) -> str: context = "\n\n".join(retrieved_chunks) result = compressor.compress_prompt( context, question=question, target_token=512, # compress to 512 tokens regardless of input size condition_in_question="after_condition", rank_method="llmlingua", ) return result["compressed_prompt"]
LLMLingua achieves 3–5× compression with < 5% quality degradation for most RAG tasks. At 2,000 tokens of retrieved context compressed to 512 tokens, you’ve reduced KV cache and TTFT by 75%.
CI/CD for LLM Changes
Three things change in an LLM system, and each requires a different pipeline:
| Change type | Risk | Pipeline |
|---|---|---|
| Prompt update | Medium — subtle quality regressions, behavior drift | Eval → review → canary |
| Model version upgrade | High — full behavior change, capability regression possible | Full benchmark → blue-green |
| RAG pipeline change | Medium-high — retrieval quality change silently degrades answers | RAGAS eval → traffic sample comparison |
Prompt CI/CD pipeline
# .github/workflows/prompt-eval.ymlname: Prompt Evalon: pull_request: paths: - "prompts/**"jobs: eval: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Run Promptfoo evals run: npx promptfoo eval --config promptfooconfig.yaml --output results.json env: AZURE_OPENAI_API_KEY: ${{ secrets.AZURE_OPENAI_API_KEY }} VLLM_ENDPOINT: ${{ secrets.VLLM_ENDPOINT }} - name: Parse results and gate run: | PASS_RATE=$(jq '.results.stats.successes / .results.stats.total' results.json) AVG_SCORE=$(jq '.results.stats.assertPassCount / .results.stats.assertCount' results.json) echo "Pass rate: $PASS_RATE" echo "Avg score: $AVG_SCORE" # Gate: 95% pass rate, average score > 4.0 on 5-point scale if (( $(echo "$PASS_RATE < 0.95" | bc -l) )); then echo "::error::Pass rate $PASS_RATE below 0.95 threshold" exit 1 fi - name: Comment results on PR uses: actions/github-script@v7 with: script: | const results = require('./results.json'); const body = `## Eval Results\n\n` + `Pass rate: ${(results.results.stats.successes / results.results.stats.total * 100).toFixed(1)}%\n` + `Failed cases: ${results.results.stats.failures}\n\n` + `[Full results artifact](${process.env.GITHUB_SERVER_URL}/${process.env.GITHUB_REPOSITORY}/actions/runs/${process.env.GITHUB_RUN_ID})`; github.rest.issues.createComment({ issue_number: context.issue.number, owner: context.repo.owner, repo: context.repo.repo, body, });
Canary rollout with per-version metrics:
<!-- APIM inbound: A/B split with version tracking --><set-variable name="promptVersion" value="@(new Random().Next(100) < 10 ? "v2" : "v1")" /><choose> <when condition="@(context.Variables.GetValueOrDefault<string>("promptVersion") == "v2")"> <!-- Route to backend that loads v2 prompt --> <set-header name="X-Prompt-Version" exists-action="override"> <value>v2</value> </set-header> </when></choose><!-- Always emit version dimension for comparison in Langfuse / App Insights --><set-header name="X-Prompt-Version-Actual" exists-action="override"> <value>@(context.Variables.GetValueOrDefault<string>("promptVersion"))</value></set-header>
Track quality_score by prompt_version in Langfuse for at least 200 samples before declaring v2 the winner and rolling to 100%.
Model upgrade pipeline
Model upgrades carry more risk than prompt changes — every behavior can shift.
1. Update model reference in staging deployment2. Run full eval suite against staging (all golden datasets)3. Run adversarial test suite (jailbreaks, injection attempts, refusal cases)4. Run latency benchmark (TTFT, TPOT, throughput at target concurrency)5. Human review of 20 randomly sampled outputs on complex cases6. If all gates pass → blue-green deploy to production7. Monitor for 48 hours at 5% traffic before full cutover8. Rollback trigger: any SLO breach or quality drop > 5% in online eval
Blue-green for model upgrades — keep the old model running during transition:
# Two deployments, one service---apiVersion: apps/v1kind: Deploymentmetadata: name: vllm-phi4-v1 namespace: inferencespec: selector: matchLabels: app: vllm version: phi4-v1---apiVersion: apps/v1kind: Deploymentmetadata: name: vllm-phi4-v2 namespace: inferencespec: selector: matchLabels: app: vllm version: phi4-v2---# HTTPRoute: route 5% to v2apiVersion: gateway.networking.k8s.io/v1kind: HTTPRoutemetadata: name: inference-routespec: rules: - backendRefs: - name: vllm-phi4-v1-svc weight: 95 - name: vllm-phi4-v2-svc weight: 5
RAG pipeline changes
Changing chunking strategy or embedding model requires re-indexing the entire corpus — a batch job, not a rolling deployment. Track which index version is active:
INDEX_VERSION = "v3-chunk512-bge-m3" # bump this on any pipeline changedef index_document(doc: str, source: str): chunks = splitter.split_text(doc) embeddings = embed_batch(chunks) search_client.upload_documents([ { "id": f"{source}-{i}-{INDEX_VERSION}", "content": chunk, "content_vector": emb, "index_version": INDEX_VERSION, "source": source, } for i, (chunk, emb) in enumerate(zip(chunks, embeddings)) ])
Run RAGAS against both the old and new index before switching the production pointer. A 5% drop in context recall is a rollback signal.
Production Failure Modes
The cold-start problem
When KEDA scales from 0 replicas to 1 and NAP provisions a new GPU node, there is a 3–8 minute gap before the first request can be served:
KEDA scale trigger fires ↓ ~10sNew pod scheduled, NAP provisions node ↓ ~2-4 minNode joins cluster, GPU driver initializes ↓ ~1-2 minPod starts, model weights loaded into VRAM ↓ ~1-2 min (Phi-4 Mini) to ~8 min (Llama 70B)First request served
Mitigation strategies:
- Keep
minReplicaCount: 1— one warm pod avoids the full cold start. The pod costs GPU time even at idle, but eliminates the provision delay. - Use predictive scaling — if your traffic has daily patterns (business hours peak), pre-scale 15 minutes before expected demand.
- Implement a queue buffer — when all pods are busy and no warm pod exists, return HTTP 202 with a queue position rather than timing out. The client polls for completion.
# KEDA ScaledObject: keep 1 warm, allow scale-to-zero for large cost savings# Use only for workloads where the cold-start is tolerable (batch jobs, async APIs)minReplicaCount: 0scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: vllm-standalonetriggers: - type: prometheus metadata: query: avg(vllm:num_requests_waiting{namespace="inference"}) threshold: "1" activationThreshold: "1"
KV cache exhaustion under load
When vllm:gpu_cache_usage_perc approaches 100%, the scheduler starts preempting (evicting) in-progress sequences to make room for new ones. Preempted sequences must restart their prefill from scratch — this causes sudden TTFT spikes under high load.
Symptoms: TTFT spikes 3–10× at load that’s well within your max-num-seqs limit, with gpu_cache_usage_perc at 95%+.
Diagnosis:
# Watch KV cache in real timekubectl exec -n inference <vllm-pod> -- curl -s http://localhost:8000/metrics \ | grep "gpu_cache_usage"
Fix (in order):
- Reduce
max-num-seqs— you’re scheduling more concurrent sequences than the KV cache can hold - Enable
--kv-cache-dtype fp8— halves KV cache memory on A100/H100 - Reduce
max-model-len— each sequence reserves less KV space - Add replicas and reduce
max-num-seqsper pod proportionally
Conversation quality degradation at depth
Multi-turn conversations degrade because the model gives less attention weight to the system prompt as the context fills up. This is a known limitation of the attention mechanism, not a bug.
Signals:
- User satisfaction scores drop after conversation turn 5–7
- The model starts ignoring format instructions it followed in early turns
- Langfuse traces show identical queries producing different quality scores based on conversation depth
Monitor it:
# In Langfuse, tag traces with turn numberlangfuse_context.update_current_trace( metadata={"conversation_turn": turn_number, "context_tokens": current_context_tokens})
Query: avg(quality_score) group by conversation_turn — a drop after turn 5 confirms the pattern.
Fix: implement progressive summarization (Section 3.5) or a fixed sliding window over conversation history.
Streaming connection accumulation
Streaming inference responses (Server-Sent Events) hold an open HTTP connection for the duration of the completion. A client that opens a streaming connection and never closes it holds a concurrency slot for the full requestTimeout.
Symptom: vllm:num_requests_running stays high even as traffic drops. GPU utilization is low but the scheduler reports maximum concurrency. New requests queue even though the GPU is largely idle.
Fix:
- Set
requestTimeouton the Envoy Gateway BackendTrafficPolicy (see ingress-guide.md) - Set
stream_timeoutin your application client:
response = client.chat.completions.create( messages=..., stream=True, timeout=httpx.Timeout(connect=5, read=120, write=10, pool=5),)
- Add a heartbeat check — if a streaming connection has produced no tokens in 30 seconds, close it from the client side.
Feedback and Continuous Improvement
Production traffic is your most valuable training signal. Every user interaction is a labeled example if you instrument it correctly.
Signal collection
Explicit feedback — integrate directly into your UI:
# When user rates a responselangfuse_client.score( trace_id=trace_id, name="user_rating", value=rating, # 1-5 or 0/1 thumbs comment=user_comment,)
Implicit feedback — infer quality from behavior:
| Behavior | Quality signal | How to measure |
|---|---|---|
| User re-prompts immediately | Bad answer | Time between response and next user message < 10s |
| Session abandonment after response | Bad answer | Session ends within 30s of a response |
| User copies response | Good answer | Clipboard event or UI copy button click |
| Escalation to human agent | Failed answer | Route change event |
| User continues conversation | Neutral to good | Any follow-up message |
# Log implicit signals as Langfuse scoresdef on_user_reprompt(trace_id: str, seconds_since_response: float): if seconds_since_response < 10: langfuse_client.score( trace_id=trace_id, name="implicit_quality", value=0, comment=f"Re-prompted after {seconds_since_response:.1f}s", )
Labeling infrastructure
Raw feedback signals need human review before entering a training dataset. Deploy Argilla on AKS for annotation workflows:
helm repo add argilla https://argilla-io.github.io/argillahelm upgrade --install argilla argilla/argilla \ --namespace argilla --create-namespace \ --set replicaCount=1 \ --set resources.requests.memory="1Gi"
Labeling pipeline:
import argilla as rg# Initialize Argilla connectionrg.init(api_url="http://argilla.argilla.svc.cluster.local:6900", api_key=...)# Push low-rated traces to Argilla for reviewdef export_low_quality_traces(min_date: datetime, max_traces: int = 200): low_quality = langfuse_client.fetch_traces( tags=["production"], min_score={"name": "user_rating", "op": "lt", "value": 3}, limit=max_traces, ) records = [ rg.TextClassificationRecord( text=trace.input["messages"][-1]["content"], prediction=[("bad_answer", 1.0)], annotation=None, # labeler will fill this in metadata={ "trace_id": trace.id, "model": trace.metadata.get("model"), "full_context": json.dumps(trace.input), "response": trace.output, }, ) for trace in low_quality.data ] rg.log(records, name="low_quality_traces", workspace="production-review")
What annotators should do: for each flagged trace, determine whether the issue is (a) wrong answer → add to fine-tuning dataset with a corrected response, (b) format violation → update prompt, (c) missing context → improve RAG retrieval, or (d) legitimate limitation → not fixable without model upgrade.
The improvement decision tree
When evaluations show a quality gap, the fix depends on the failure mode:
Quality gap observed │ ├─ Wrong format / style? │ → Fix the prompt (system prompt + examples) │ → If persistent after prompt fix → fine-tune on format │ ├─ Factually wrong on domain knowledge? │ ├─ Knowledge available in documents? │ │ → RAG: add to index, improve chunking/retrieval │ └─ Knowledge not in documents? │ → Fine-tune if static knowledge; accept limitation if dynamic │ ├─ Wrong answer on complex reasoning? │ ├─ Passes on larger model (GPT-4o / Llama 70B)? │ │ → Either use larger model OR fine-tune smaller model on CoT examples │ └─ Fails on all models? │ → Task may be inherently ambiguous; clarify requirements │ ├─ Inconsistent behavior (same query, different answers)? │ → Lower temperature; use CoT; add few-shot examples; fine-tune │ └─ Safety/refusal failures? → Add to adversarial test suite; fix system prompt; fine-tune on refusals
Detecting model drift
Unlike ML models, LLMs don’t drift due to distribution shift in inputs (they’re not trained on your data). They drift when:
- The model is upgraded (vLLM image tag changes) — use a pinned model version
- The underlying base model checkpoint is updated on HuggingFace — pin to a specific revision
- Your prompt changes affect capabilities you didn’t test
Run your golden dataset eval weekly on production traffic sampling, not just at deploy time:
# .github/workflows/weekly-drift-check.ymlon: schedule: - cron: '0 6 * * 1' # every Monday at 6am UTCjobs: drift-check: steps: - name: Run eval against production endpoint run: | npx promptfoo eval \ --config promptfooconfig.yaml \ --providers vllm:http://inference-prod.yourdomain.com/v1 \ --output weekly-results.json - name: Compare to baseline run: | python scripts/compare_eval_results.py \ --baseline eval-baselines/latest.json \ --current weekly-results.json \ --threshold 0.05 # alert if pass rate drops > 5%
PTU vs. Consumption for the Fallback Path
When your architecture uses Azure OpenAI as a fallback (vLLM primary → Azure OpenAI on overload), the billing model for the fallback affects your cost floor.
Consumption — pay per token, shared capacity, subject to throttling. Right when traffic is unpredictable or low.
PTU — reserved capacity, guaranteed throughput, billed per hour. Right when you can predict traffic volume and the volume justifies the reservation.
Break-even: PTU is cheaper once you exceed ~60–70% utilization of the provisioned throughput. Below that, consumption is cheaper. Use the Azure OpenAI PTU calculator with your measured TPM from production.
Hybrid strategy: PTU as primary guaranteed capacity + consumption as overflow:
<!-- APIM: PTU primary, consumption overflow on 429 --><retry condition="@(context.Response.StatusCode == 429)" count="1" interval="0"> <set-backend-service base-url="{{aoai-consumption-endpoint}}" /> <set-header name="api-key" exists-action="override"> <value>{{aoai-consumption-key}}</value> </set-header></retry>
The interval="0" is intentional for a backend switch (no wait needed — you’re routing to a different endpoint, not retrying the same one). Do not set interval > 0 for the consumption fallback — you want immediate rerouting, not a backoff.
Recommended Stack
This stack is opinionated for the AKS-on-Azure context of this lab. Every component has a clear scope; no two overlap.
| Layer | Tool | Hosting | Scope |
|---|---|---|---|
| Edge / WAF | Azure Front Door Premium | Managed | DDoS, WAF, global routing |
| API gateway | Azure API Management | Managed | Auth, token rate limiting, cost attribution, fallback routing |
| Inference engine | vLLM | AKS GPU pool | High-throughput serving, prefix caching, continuous batching |
| Autoscaling | KEDA + NAP | AKS | Request-driven scale, GPU node lifecycle |
| Application tracing | Langfuse | AKS (self-hosted) | Per-request traces, quality scores, cost attribution by user |
| System metrics | Azure Managed Prometheus + Grafana | Managed | vLLM metrics, GPU utilization, KEDA queue depth |
| Evals (offline) | Promptfoo | CI (GitHub Actions) | Pre-deploy quality gate on prompt/model changes |
| Evals (online) | Langfuse scores + LLM-as-judge | AKS / CI | Continuous quality monitoring in production |
| RAG retrieval | Azure AI Search | Managed | Hybrid search (BM25 + vector), semantic ranking |
| Guardrails | Azure AI Content Safety | Managed | Prompt Shield (input) + harm detection (output) via APIM |
| Labeling | Argilla | AKS | Annotation workflow for fine-tuning datasets |
| Fine-tuning | KAITO QLoRA jobs | AKS GPU pool | LoRA/QLoRA training on A100 nodes |
| Model registry | Azure Container Registry | Managed | Fine-tuned model images, digest-pinned |
References
- vLLM metrics documentation
- RAGAS evaluation framework
- Langfuse self-hosting
- Promptfoo documentation
- KAITO fine-tuning documentation
- LLMLingua prompt compression
- Argilla annotation platform
- HyDE: Precise Zero-Shot Dense Retrieval without Relevance Labels
- LoRA: Low-Rank Adaptation of Large Language Models
- QLoRA: Efficient Finetuning of Quantized LLMs
- RAGAS: Automated Evaluation of Retrieval Augmented Generation
- Azure OpenAI PTU calculator
- Azure AI Content Safety
- LiteLLM Proxy