The AI Engineering Stack

AI Engineering focuses on applying Foundation Models—large-scale AI models pre-trained on vast datasets—into production applications. Unlike traditional Machine Learning, AI Engineering shifts focus from training model architectures to orchestrating inference, evaluating performance, and connecting models to external tools and context.

Core Disciplines
  • Prompt Engineering: Structuring context and heuristics to elicit optimal behavior.
  • RAG (Retrieval-Augmented Generation): Grounding generation in factual, external databases.
  • Evaluation: Using perplexity, exact matching, and "LLM-as-a-Judge" to score outputs.
  • Finetuning: Creating domain-specific models via Parameter-Efficient Finetuning (PEFT).
The Paradigm Shift

Traditional ML: Feature Engineering ➔ Model Selection ➔ Training loop ➔ Deployment.

AI Engineering: Prompt Design ➔ Context Engineering (RAG) ➔ System Evaluation ➔ (Optional) Finetuning/Optimization.

Generation & Sampling

Language models output a probability distribution (logits) for the next token in a sequence. Sampling techniques like Temperature, Top-K, and Top-P (Nucleus) manipulate this distribution before the final token is chosen, balancing creativity against determinism.

Adjust Hyperparameters
Temperature (T) 1.0
T > 1 flattens distribution (creative). T < 1 sharpens distribution (deterministic).
Top-K 10
Limits selection to the K most probable tokens.
Top-P (Nucleus) 1.0
Selects subset of tokens whose cumulative probability is ≥ P.
Next Token Probability Distribution
Context: "The cat sat on the..."

Retrieval-Augmented Generation

RAG grounds LLMs in external knowledge by injecting relevant documents directly into the prompt context at runtime, solving hallucination and knowledge cutoff issues.

Architecture Flow
1. User Query
"What is AI Engineering?"
2. Embedding Model
Convert query to dense vector: [0.12, -0.45, ...]
3. Vector Search
Cosine similarity against document chunks in Vector DB.
4. Contextual Prompt Generation
Combine system instructions, retrieved texts, and user query.
Live Prompt Construction
Click 'Simulate Pipeline' to trace execution...

VRAM Math & Quantization

Inference optimization requires understanding GPU memory economics. Model weights, K-V caches, and activations consume VRAM. Quantization (reducing precision from 16-bit to 8-bit or 4-bit) drastically cuts hardware requirements.

Model Size (Parameters)
Precision / Quantization
Context Size (Tokens)
Model Weights VRAM: 14.00 GB
KV Cache VRAM (Est): 1.20 GB
Activations & Overhead: ~1.50 GB
Total Required VRAM: 16.70 GB
Requires: 1x RTX 3090 / 4090 (24GB)

Agentic Workflows

An Agent is an LLM equipped with tools (functions) and a cyclic reasoning loop (like ReAct: Reason + Act). Instead of a single generation pass, the agent can search the web, execute code, observe the results, and refine its internal state before answering.

1. Plan / Thought
LLM decides what to do next.
2. Action (Tool Call)
Execute external function (e.g., Python, Search).
3. Observation
Return tool result back to LLM context.
Agent Execution Trace
Goal: "Find the square root of the population of France."
Waiting to start...