RAG on Azure: Self-Hosted vs Managed Stack

What is RAG The Problem RAG Solves Large Language Models (LLM) learn from a massive but frozen snapshot of the world. Once training ends, the model’s knowledge is sealed. It cannot read your internal documentation, does not know what changed last quarter, and has never seen your proprietary data. The result: when you ask an LLM about anything outside its training data, it fabricates a plausible-sounding answer. This is called hallucination — and it is not a bug that will be fixed. It is a fundamental property of how language models work. Three strategies exist to close this gap: Strategy How it works … Continue reading RAG on Azure: Self-Hosted vs Managed Stack

Securing Applications That Rely on Inference Servers

Inference servers introduce a threat model that differs from standard web APIs. The differences matter: This guide covers the controls needed at each layer: edge, API management, in-cluster networking, identity, observability, and supply chain. This blog post uses  https://rtrentinsworld.com/2026/03/27/running-llm-inference-on-aks/ as reference. Edge Protection: WAF and DDoS The first line of defense for any publicly reachable inference endpoint is a Web Application Firewall running in Prevention mode, not Detection mode. Detection mode logs attacks but passes them through. Every prompt injection payload, malformed JSON body, and RCE attempt in HTTP headers reaches your APIM and potentially your GPU pods. Switching to Prevention blocks them … Continue reading Securing Applications That Rely on Inference Servers

Self-Hosted LLMOps

When you call Azure OpenAI or the OpenAI API, most of the operational surface disappears: Microsoft or OpenAI manages the GPU, the model weights, the inference runtime, and the content filters. Your ops surface is prompts, evals, and cost control. Self-hosted LLMOps is what remains when you take all of that back. You own the GPU lifecycle, the model serving process, the scaling logic, the guardrails, the observability pipeline, and the feedback loop that improves quality over time. The tradeoffs that make self-hosting worth it — data sovereignty, cost at volume, no vendor lock-in, full control over serving parameters — … Continue reading Self-Hosted LLMOps

Running LLM Inference on AKS

Most teams running LLMs start with a cloud API. At some point — whether driven by cost, compliance, or latency — the question becomes: should we self-host? And if we do, should we run on a VM or on Kubernetes? This post answers those questions with specifics. It covers when AKS + GPU inference makes sense, how to choose the right model for your use case, and how to size every layer of the stack: GPU node, pod configuration, and replica count. When Does GPU Inference on AKS Make Sense? Option A: Cloud API (Azure OpenAI) No infrastructure. Pay per token. … Continue reading Running LLM Inference on AKS

Network diagram — multi-cloud topology with AWS and Azure transits

I Got Tired of Writing Design Documents, So I Built a Tool That Does It for Me

If you’ve ever had to write a Design Document from scratch — you know the pain. You’re staring at dozens of Terraform files, cross-referencing module parameters, tracing spoke-to-transit attachments, figuring out which firewall image string maps to which vendor and … Continue reading I Got Tired of Writing Design Documents, So I Built a Tool That Does It for Me

Solving PAN-OS Routing Issues with Enforce-Symmetric-Return

Overview Inbound internet traffic to workloads in Aviatrix spoke VPCs is routed through PAN-OS firewalls for inspection using a Global External Application Load Balancer with Zonal NEGs. A Policy Based Forwarding (PBF) rule with enforce-symmetric-return on PAN-OS handles the asymmetric routing caused by the GFE proxy sourcing all traffic from 35.191.0.0/16. Architecture Why PBF with Enforce-Symmetric-Return The Global Application LB is a reverse proxy — ALL backend traffic (health checks and real user requests) arrives from Google Front End IPs in the 35.191.0.0/16 range. This creates an asymmetric routing problem: Why dual VRs don’t solve this: PAN-OS sessions are NOT … Continue reading Solving PAN-OS Routing Issues with Enforce-Symmetric-Return

Meet Pyr Reader: An AI-Powered Content Hub Built with Rust and Tauri

I built a desktop app to solve a problem I kept running into: information overload. Between RSS feeds, email newsletters, and social media, I was drowning in content with no good way to organize, prioritize, or actually learn from it. Pyr Reader is my answer — a native macOS app that pulls content from multiple sources, classifies it with AI, and helps me focus on what actually matters. Named after Carlos Alberto, my Great Pyrenees — a loyal, watchful companion. Pyr Reader watches over your information feeds so you don’t have to. The Problem Every morning I’d open a dozen tabs: RSS reader, … Continue reading Meet Pyr Reader: An AI-Powered Content Hub Built with Rust and Tauri

Carlos, The Cloud Architect

Overview Carlos the Architect implements a multi-agent Software Development Lifecycle (SDLC) for cloud infrastructure design. The system uses 11 specialized AI agents orchestrated through LangGraph to automate the complete journey from requirements gathering to production-ready Terraform code, with historical learning from past deployment feedback. SDLC Phases Mapped to Agents SDLC Phase Agent(s) Output Purpose 1. Requirements Requirements Gathering Clarifying questions Understand user needs 2. Learning Historical Learning Context from past designs Learn from deployment feedback 3. Design Carlos + Ronei (parallel) 2 architecture designs Competitive design generation 4. Analysis Security, Cost, SRE (parallel) 3 specialist reports Multi-dimensional review 5. Review Chief Auditor Approval decision … Continue reading Carlos, The Cloud Architect

kubectl-ai

What it is kubectl-ai acts as an intelligent interface, translating user intent into precise Kubernetes operations, making Kubernetes management more accessible and efficient. How to install Gemini API Key Go to https://aistudio.google.com/ then Get API Keys: Depending on the tier you will need to import a Google Cloud Project for billing purposes. Testing A simple test to validate the configuration. I asked kubectl-ai to list k8s clusters i have access: Costs https://ai.google.dev/gemini-api/docs/pricing References https://github.com/GoogleCloudPlatform/kubectl-ai?tab=readme-ov-file Continue reading kubectl-ai