When we first start exploring AI and Machine Learning, the terminology can feel overwhelming. Terms like "Transformer", "backpropagation", and "RAG" get thrown around everywhere, and it's easy to feel lost without a solid foundation.
Whether we're developers building with AI, engineers diving into ML, or anyone looking to understand the tech behind tools like ChatGPT and Claude, this guide covers what we need. I've selected 30 terms that appear repeatedly in research papers, technical discussions, and real-world AI projects.
Each term includes a clear definition, why it matters, and practical examples from real-world AI systems. Let's build our AI vocabulary from the ground up.
This article is a practical, visual glossary of 30 AI/ML concepts every modern developer should understand.
Table of Contents
Why These 30 Terms?
Whether we're building our first AI agent, integrating LLMs into an application, or trying to understand what our ML team is talking about, this glossary covers the core vocabulary we'll encounter daily. I've organized these terms from foundational concepts to modern approaches like agentic AI, with interactive visualizations to make abstract ideas concrete.
1. Foundational Concepts
Before diving into advanced topics, let's establish the fundamental building blocks of AI and machine learning.
1.1 Machine Learning
A subset of AI where systems learn patterns from data rather than being explicitly programmed. Instead of writing rules like "if email contains 'free money', mark as spam", we provide thousands of examples and let the algorithm discover the patterns itself.
Why it matters: ML is the foundation everything else builds on. When we use recommendation systems, fraud detection, or search engines, we're using ML. Understanding the core paradigm helps us choose the right approach for our problems.
There are three main types:
- Supervised Learning — Learn from labeled examples (input → output pairs)
- Unsupervised Learning — Find patterns in unlabeled data
- Reinforcement Learning — Learn through trial and error with rewards
1.2 Deep Learning
A subset of machine learning that uses neural networks with many layers (hence "deep") to learn complex patterns. The "depth" allows the model to learn hierarchical representations — early layers detect simple features (edges, colors), while deeper layers combine these into complex concepts (faces, objects).
Deep learning powers most modern AI breakthroughs: image recognition, speech synthesis, language models, and game-playing agents.
1.3 Neural Network
A computing system inspired by the human brain, consisting of interconnected nodes (neurons) organized in layers. Each connection has a weight that adjusts during learning.
1.4 Reinforcement Learning (RL)
A learning paradigm where an agent learns by interacting with an environment, receiving rewards or penalties for its actions. Unlike supervised learning where we provide correct answers, RL agents discover optimal strategies through trial and error — much like how we learn to ride a bike by falling and adjusting.
Key components: Agent (the learner), Environment (what it interacts with), State (current situation), Action (what agent can do), Reward (feedback signal).
The RL loop: Agent observes state, takes action, receives reward, learns, and repeats.
Why it matters: RL powers some of AI's most impressive achievements. AlphaGo used RL to defeat world champions at Go, a game with more possible positions than atoms in the universe. Today, RL is central to agentic AI systems where agents must plan multi-step actions, use tools, and adapt to dynamic environments. When we build AI agents that can browse the web, write code, or manage workflows, we're applying RL principles.
RL is the training approach behind game-playing agents like AlphaGo and robotics systems that learn complex movements. More recently, RLHF (Reinforcement Learning from Human Feedback) has become critical for aligning LLMs. It's how models like GPT-4 and Claude learn to be helpful rather than harmful. The agent generates responses, humans rate them, and the model learns from that feedback signal.
2. Architecture & Model Design
Understanding model architectures helps us choose the right approach for our tasks.
2.1 Transformer
A neural network architecture introduced in the 2017 paper, "Attention Is All You Need," that changed how we build language models by using self-attention to process entire sequences in parallel. Unlike RNNs that process tokens sequentially, Transformers can look at all tokens simultaneously.
Why it matters: Every major LLM we use today (GPT-4, Claude, Gemini, LLaMA) is built on the Transformer architecture. Understanding Transformers helps us grasp why models have context limits, why they're expensive to run, and how attention mechanisms work. It's the single most important architecture in modern AI.
The Transformer architecture processes input through embedding, multiple attention blocks, and produces output. GPT (2018), BERT (2018), Claude, and Gemini are all Transformer-based.
2.2 Encoder/Decoder
Two complementary components used in sequence-to-sequence tasks:
- Encoder — Compresses input into a dense representation (understanding)
- Decoder — Generates output from that representation (generation)
BERT uses encoder-only (understanding), GPT uses decoder-only (generation), T5 uses both (translation, summarization).
2.3 Mixture of Experts (MoE)
An architecture where multiple specialized "expert" networks exist, but only a subset are activated for each input. A "router" network decides which experts to use. MoE enables massive models with efficient computation.
Mixtral 8x7B has 8 expert networks but only uses 2 per token. It has 47B total parameters but only 13B active per forward pass.
2.4 Tokenization
The process of breaking text into smaller units (tokens) that the model can process. Modern tokenizers use subword algorithms like BPE (Byte Pair Encoding) or SentencePiece.
Example: "unhappiness" → ["un", "happiness"]. Subword tokenization lets models understand word parts and handle rare words by combining known subwords.
Why it matters: Understanding tokenization helps explain why GPT-4 struggles with counting letters: it doesn't see individual characters, just token chunks.
3. Attention Mechanisms
Attention is what makes Transformers work so well.
3.1 Self-Attention
A mechanism that allows each position in a sequence to attend to all other positions, weighing their relevance. For each token, the model computes:
- Query (Q) — What am I looking for?
- Key (K) — What do I contain?
- Value (V) — What information do I have?
3.2 Multi-Head Attention
Running multiple self-attention operations in parallel, each with different learned weights (different "heads"). Multiple heads let the model attend to information from different representation subspaces simultaneously.
(subject-verb)
(nearby tokens)
(meaning)
(pronouns)
One head might focus on syntactic relationships (subject-verb agreement), another on semantic relationships (word meanings), another on positional patterns.
4. Training & Optimization
Understanding how models learn from data.
The training loop: forward pass → compute loss → backpropagate → update weights → repeat.
4.1 Gradient Descent
An optimization algorithm that iteratively adjusts model parameters to minimize the loss function. It calculates the gradient (direction of steepest increase in loss) and moves in the opposite direction.
Variants: SGD (stochastic), Adam (adaptive learning rates), AdamW (with weight decay).
4.2 Backpropagation
The algorithm for computing gradients in neural networks by propagating errors backward from output to input using the chain rule of calculus. This tells us how much each weight contributed to the error.
4.3 Loss Function
A function that measures how wrong the model's predictions are. Training aims to minimize this loss. Common losses:
- Cross-Entropy — Classification tasks
- MSE (Mean Squared Error) — Regression tasks
- Contrastive Loss — Embedding learning
4.4 Training vs Inference
Training: Learning from data by adjusting weights. Expensive (weeks/months, thousands of GPUs). Inference: We use the trained model to make predictions. Fast (milliseconds, single GPU).
GPT-4 was trained once at enormous cost. Every API call is inference: much faster and cheaper.
4.5 Hyperparameter
Configuration settings that control the training process but aren't learned from data:
- Learning rate — Step size in gradient descent
- Batch size — Samples processed before weight update
- Number of layers/heads — Model architecture
- Dropout rate — Regularization strength
5. Fine-Tuning & Efficiency
Techniques to adapt and optimize models efficiently.
5.1 Fine-tuning
Taking a pre-trained model and training it further on a specific dataset or task. This adapts general knowledge to specialized domains more efficiently than training from scratch.
Types: Full fine-tuning (all weights), Parameter-efficient (subset of weights), Instruction tuning (following instructions).
5.2 LoRA / QLoRA
LoRA (Low-Rank Adaptation): Instead of fine-tuning all weights, add small trainable "adapter" matrices. Original weights stay frozen, only adapters are trained.
QLoRA: Combines LoRA with 4-bit quantization: fine-tune a 65B model on a single 48GB GPU!
5.3 Quantization
Reducing the precision of model weights (e.g., FP32 → INT8 → INT4) to decrease memory usage and speed up inference. Modern quantization methods (GPTQ, AWQ) maintain quality surprisingly well.
Why it matters: This is why we can run Llama 70B on a gaming laptop. 4-bit quantization makes it possible.
5.4 Knowledge Distillation
Training a smaller "student" model to mimic a larger "teacher" model's behavior. The student learns from the teacher's soft probability distributions, not just hard labels, capturing more nuanced knowledge that wouldn't be available from training data alone.
Why it matters: We often need the intelligence of a 70B parameter model but can only deploy a 7B model due to cost or latency constraints. Distillation bridges this gap. The student doesn't just learn "this is a cat"; it learns "this is 90% cat, 8% dog, 2% fox," inheriting the teacher's understanding of similar concepts.
Real examples: DistilBERT retains 97% of BERT's performance with 40% fewer parameters. Microsoft's Phi models are trained partly on synthetic data generated by larger models. This technique is why we now have capable models that run on phones and laptops. The knowledge from massive models gets compressed into smaller, deployable ones.
6. Model Evaluation
Understanding model performance and failure modes.
6.1 Overfitting / Underfitting
Overfitting: Model memorizes training data but fails on new data (high training accuracy, low test accuracy). Underfitting: Model is too simple to capture patterns (poor on both).
Signs of overfitting: Training loss keeps decreasing while validation loss increases.
6.2 Regularization
Techniques to prevent overfitting by adding constraints during training:
- Dropout — Randomly disable neurons during training
- L1/L2 regularization — Penalize large weights
- Early stopping — Stop when validation loss increases
- Data augmentation — Create variations of training data
6.3 Hallucination
When an AI model generates plausible-sounding but factually incorrect information. LLMs are particularly prone to this because they're trained to produce fluent text, not necessarily true text.
Why it matters: This is the primary reason production AI systems need RAG, guardrails, and human oversight.
Mitigations: RAG (ground responses in documents), chain-of-thought (show reasoning), confidence calibration, human feedback.
7. Prompting & Reasoning
Techniques for getting better outputs from language models.
7.1 Prompt Engineering
The practice of crafting effective inputs to guide the LLM's behavior. Key techniques:
- Zero-shot — Ask directly without examples
- Few-shot — Provide examples in the prompt
- Role-playing — "You are an expert..."
- Structured output — Request JSON, markdown, etc.
7.2 Context Engineering
The broader practice of managing all information provided to an AI system: system prompts, retrieved documents, conversation history, tool outputs, user preferences.
While prompt engineering focuses on crafting the user message, context engineering considers the full picture: What goes in the system prompt? Which documents should RAG retrieve? How much conversation history fits in the context window? Should we include tool outputs or examples?
Why it matters: For production AI systems, context engineering often matters more than the model choice. A well-engineered context with a smaller model frequently outperforms a larger model with poor context. In my experience building agents with ADK, getting context management right is often the difference between a demo and a production-ready system.
7.3 Chain-of-Thought (CoT)
A prompting technique where the model shows its reasoning step-by-step before giving a final answer. Simply adding "Let's think step by step" can boost accuracy on math and logic tasks by 10-40%.
8. Infrastructure & Retrieval
Building systems that connect LLMs to external knowledge.
8.1 Embedding
A dense vector representation of data in a continuous space where similar items are close together. Text embeddings capture semantic meaning: "king" and "queen" are close, "king" and "apple" are far.
The famous example: embedding("king") - embedding("man") + embedding("woman") ≈ embedding("queen")
8.2 Vector Database
A database optimized for storing and querying high-dimensional vectors. Uses approximate nearest neighbor (ANN) algorithms for fast similarity search across millions of embeddings.
Popular options: Pinecone, Weaviate, ChromaDB, Milvus, pgvector (PostgreSQL extension).
8.3 RAG (Retrieval-Augmented Generation)
A technique that retrieves relevant documents and includes them in the LLM's context before generating a response. This grounds responses in actual data, reduces hallucination, and enables access to private/current information.
RAG pipeline: query → embed → search vector DB → retrieve relevant docs → augment prompt → generate grounded response.
9. Modern AI Systems
How AI systems are being built today.
9.1 MCP (Model Context Protocol)
An open protocol by Anthropic that standardizes how AI applications connect to external data sources and tools. Instead of building custom integrations for each service, MCP provides a universal interface.
With MCP, an AI assistant can connect to Google Drive, Slack, GitHub, databases, and more through standardized "MCP servers," enabling plug-and-play integrations.
9.2 Agentic AI
AI systems that can autonomously plan, make decisions, use tools, and take actions to accomplish goals. Goes beyond simple Q&A to multi-step reasoning and execution.
Agentic AI loop: receive goal → plan steps → execute actions using tools → observe results → iterate until goal is achieved.
An agentic AI can: research a topic → write code → test it → fix bugs → deploy, all autonomously with minimal human intervention.
Conclusion
These 30 terms form the vocabulary of modern AI development. Whether we're building with LLMs, training custom models, or architecting AI systems, understanding these concepts helps us communicate effectively with our teams and make better technical decisions.
Keep this glossary bookmarked as a reference. The AI field moves fast, but these fundamentals stay relevant.
Resources
- Attention Is All You Need — The original Transformer paper
- LoRA: Low-Rank Adaptation — Parameter-efficient fine-tuning
- Chain-of-Thought Prompting — Step-by-step reasoning
- RAG: Retrieval-Augmented Generation — Grounding LLMs in knowledge
- Model Context Protocol (MCP) — Open protocol for AI integrations