Top 30 AI/ML Terms Every Developer Should Know

December 7, 2025

When we first start exploring AI and Machine Learning, the terminology can feel overwhelming. Terms like "Transformer", "backpropagation", and "RAG" get thrown around everywhere, and it's easy to feel lost without a solid foundation.

Whether we're developers building with AI, engineers diving into ML, or anyone looking to understand the tech behind tools like ChatGPT and Claude, this guide covers what we need. I've selected 30 terms that appear repeatedly in research papers, technical discussions, and real-world AI projects.

Each term includes a clear definition, why it matters, and practical examples from real-world AI systems. Let's build our AI vocabulary from the ground up.

This article is a practical, visual glossary of 30 AI/ML concepts every modern developer should understand.

1. Foundational Concepts
2. Architecture & Model Design
3. Attention Mechanisms
- 3.1 Self-Attention
- 3.2 Multi-Head Attention
4. Training & Optimization
5. Fine-Tuning & Efficiency
6. Model Evaluation
7. Prompting & Reasoning
8. Infrastructure & Retrieval
9. Modern AI Systems
- 9.1 MCP (Model Context Protocol)
- 9.2 Agentic AI

Why These 30 Terms?

Whether we're building our first AI agent, integrating LLMs into an application, or trying to understand what our ML team is talking about, this glossary covers the core vocabulary we'll encounter daily. I've organized these terms from foundational concepts to modern approaches like agentic AI, with interactive visualizations to make abstract ideas concrete.

1. Foundational Concepts

Before diving into advanced topics, let's establish the fundamental building blocks of AI and machine learning.

graph TB AI[Artificial Intelligence] --> ML[Machine Learning] ML --> DL[Deep Learning] ML --> RL[Reinforcement Learning] DL --> NN[Neural Networks] DL --> CNN[CNNs] DL --> Transformer[Transformers] Transformer --> LLM[Large Language Models] style AI fill:#e3f2fd,stroke:#1565c0,stroke-width:2px style ML fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px style DL fill:#fff3e0,stroke:#ef6c00,stroke-width:2px style LLM fill:#fce4ec,stroke:#c2185b,stroke-width:2px

1.1 Machine Learning

A subset of AI where systems learn patterns from data rather than being explicitly programmed. Instead of writing rules like "if email contains 'free money', mark as spam", we provide thousands of examples and let the algorithm discover the patterns itself.

Why it matters: ML is the foundation everything else builds on. When we use recommendation systems, fraud detection, or search engines, we're using ML. Understanding the core paradigm helps us choose the right approach for our problems.

There are three main types:

Supervised Learning — Learn from labeled examples (input → output pairs)
Unsupervised Learning — Find patterns in unlabeled data
Reinforcement Learning — Learn through trial and error with rewards

1.2 Deep Learning

A subset of machine learning that uses neural networks with many layers (hence "deep") to learn complex patterns. The "depth" allows the model to learn hierarchical representations — early layers detect simple features (edges, colors), while deeper layers combine these into complex concepts (faces, objects).

Deep learning powers most modern AI breakthroughs: image recognition, speech synthesis, language models, and game-playing agents.

1.3 Neural Network

A computing system inspired by the human brain, consisting of interconnected nodes (neurons) organized in layers. Each connection has a weight that adjusts during learning.

Neural Network Architecture

Each neuron connects to all neurons in the next layer (fully connected). Hover over nodes!

1.4 Reinforcement Learning (RL)

A learning paradigm where an agent learns by interacting with an environment, receiving rewards or penalties for its actions. Unlike supervised learning where we provide correct answers, RL agents discover optimal strategies through trial and error — much like how we learn to ride a bike by falling and adjusting.

Key components: Agent (the learner), Environment (what it interacts with), State (current situation), Action (what agent can do), Reward (feedback signal).

graph LR subgraph Agent["Agent (Policy)"] A[Observe State] B[Select Action] C[Learn from Reward] end subgraph Environment D[Execute Action] E[Return State] F[Return Reward] end A --> B B -->|Action| D D --> E D --> F E -->|New State| A F -->|Reward Signal| C C --> A style Agent fill:#e8f5e9,stroke:#2e7d32 style Environment fill:#e3f2fd,stroke:#1565c0

The RL loop: Agent observes state, takes action, receives reward, learns, and repeats.

Why it matters: RL powers some of AI's most impressive achievements. AlphaGo used RL to defeat world champions at Go, a game with more possible positions than atoms in the universe. Today, RL is central to agentic AI systems where agents must plan multi-step actions, use tools, and adapt to dynamic environments. When we build AI agents that can browse the web, write code, or manage workflows, we're applying RL principles.

RL is the training approach behind game-playing agents like AlphaGo and robotics systems that learn complex movements. More recently, RLHF (Reinforcement Learning from Human Feedback) has become critical for aligning LLMs. It's how models like GPT-4 and Claude learn to be helpful rather than harmful. The agent generates responses, humans rate them, and the model learns from that feedback signal.

2. Architecture & Model Design

Understanding model architectures helps us choose the right approach for our tasks.

2.1 Transformer

A neural network architecture introduced in the 2017 paper, "Attention Is All You Need," that changed how we build language models by using self-attention to process entire sequences in parallel. Unlike RNNs that process tokens sequentially, Transformers can look at all tokens simultaneously.

Why it matters: Every major LLM we use today (GPT-4, Claude, Gemini, LLaMA) is built on the Transformer architecture. Understanding Transformers helps us grasp why models have context limits, why they're expensive to run, and how attention mechanisms work. It's the single most important architecture in modern AI.

graph TB Input[Input Tokens] --> Embed[Embedding + Positional Encoding] Embed --> Attention[Multi-Head Self-Attention] Attention --> FFN[Feed-Forward Network] FFN --> Norm[Layer Normalization] Norm --> Output[Output] subgraph Transformer["Transformer Block (×N)"] Attention FFN Norm end style Input fill:#e3f2fd,stroke:#1565c0,stroke-width:2px style Transformer fill:#f5f5f5,stroke:#9e9e9e,stroke-width:1px style Output fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px

The Transformer architecture processes input through embedding, multiple attention blocks, and produces output. GPT (2018), BERT (2018), Claude, and Gemini are all Transformer-based.

2.2 Encoder/Decoder

Two complementary components used in sequence-to-sequence tasks:

Encoder — Compresses input into a dense representation (understanding)
Decoder — Generates output from that representation (generation)

Encoder-Decoder Architecture (Translation Example)

"Hello"

Input

→

Encoder

Understand

→

[0.2, 0.8, ...]

Context Vector

→

Decoder

Generate

→

"Bonjour"

Output

BERT = Encoder-only • GPT = Decoder-only • T5 = Both

BERT uses encoder-only (understanding), GPT uses decoder-only (generation), T5 uses both (translation, summarization).

2.3 Mixture of Experts (MoE)

An architecture where multiple specialized "expert" networks exist, but only a subset are activated for each input. A "router" network decides which experts to use. MoE enables massive models with efficient computation.

Mixture of Experts — Sparse Activation

Input Token: "code"

Router Network

↓ Selects top-2 experts ↓

Expert 1

Math

Expert 2

Code

Expert 3

Logic

Expert 4

Language

Expert 5

Creative

Expert 6

Science

Expert 7

History

Expert 8

General

Only 2 of 8 experts are active per token. 47B params total, 13B active!

Mixtral 8x7B has 8 expert networks but only uses 2 per token. It has 47B total parameters but only 13B active per forward pass.

2.4 Tokenization

The process of breaking text into smaller units (tokens) that the model can process. Modern tokenizers use subword algorithms like BPE (Byte Pair Encoding) or SentencePiece.

Interactive Tokenizer (BPE-style)

Type any text above to see how it might be tokenized into subwords

Example: "unhappiness" → ["un", "happiness"]. Subword tokenization lets models understand word parts and handle rare words by combining known subwords.

Why it matters: Understanding tokenization helps explain why GPT-4 struggles with counting letters: it doesn't see individual characters, just token chunks.

3. Attention Mechanisms

Attention is what makes Transformers work so well.

3.1 Self-Attention

A mechanism that allows each position in a sequence to attend to all other positions, weighing their relevance. For each token, the model computes:

Query (Q) — What am I looking for?
Key (K) — What do I contain?
Value (V) — What information do I have?

The Attention Formula

Attention(Q, K, V) = softmax(QK^T / √d_k) × V

Query - "What am I looking for?"

Key - "What do I contain?"

Value - "What info do I have?"

Self-Attention Heatmap: "The cat sat on the mat"

The

cat

sat

the

mat

The

cat

sat

the

mat

Low

High

Hover over cells to see attention weights. Notice: "cat" strongly attends to "sat" (subject→verb)

3.2 Multi-Head Attention

Running multiple self-attention operations in parallel, each with different learned weights (different "heads"). Multiple heads let the model attend to information from different representation subspaces simultaneously.

Multi-Head Attention

Input Sequence → Linear Projections

↓ Split into h heads ↓

Head 1

Syntax
(subject-verb)

Head 2

Position
(nearby tokens)

Head 3

Semantics
(meaning)

Head h

Coreference
(pronouns)

↓ Concatenate ↓

Combined Rich Representation

Each head learns different relationship patterns. GPT-4 uses 96+ heads!

One head might focus on syntactic relationships (subject-verb agreement), another on semantic relationships (word meanings), another on positional patterns.

4. Training & Optimization

Understanding how models learn from data.

graph LR Data[Training Data] --> Forward[Forward Pass] Forward --> Loss[Compute Loss] Loss --> Backward[Backpropagation] Backward --> Update[Update Weights] Update --> Forward style Data fill:#e3f2fd,stroke:#1565c0,stroke-width:2px style Loss fill:#ffcdd2,stroke:#c62828,stroke-width:2px style Update fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px

The training loop: forward pass → compute loss → backpropagate → update weights → repeat.

4.1 Gradient Descent

An optimization algorithm that iteratively adjusts model parameters to minimize the loss function. It calculates the gradient (direction of steepest increase in loss) and moves in the opposite direction.

Gradient Descent — Finding the Minimum

High Loss ↑ ↓ Low Loss (Minimum)

The optimizer steps downhill, following the negative gradient toward minimum loss

Variants: SGD (stochastic), Adam (adaptive learning rates), AdamW (with weight decay).

4.2 Backpropagation

The algorithm for computing gradients in neural networks by propagating errors backward from output to input using the chain rule of calculus. This tells us how much each weight contributed to the error.

4.3 Loss Function

A function that measures how wrong the model's predictions are. Training aims to minimize this loss. Common losses:

Cross-Entropy — Classification tasks
MSE (Mean Squared Error) — Regression tasks
Contrastive Loss — Embedding learning

4.4 Training vs Inference

Training: Learning from data by adjusting weights. Expensive (weeks/months, thousands of GPUs). Inference: We use the trained model to make predictions. Fast (milliseconds, single GPU).

GPT-4 was trained once at enormous cost. Every API call is inference: much faster and cheaper.

4.5 Hyperparameter

Configuration settings that control the training process but aren't learned from data:

Learning rate — Step size in gradient descent
Batch size — Samples processed before weight update
Number of layers/heads — Model architecture
Dropout rate — Regularization strength

5. Fine-Tuning & Efficiency

Techniques to adapt and optimize models efficiently.

5.1 Fine-tuning

Taking a pre-trained model and training it further on a specific dataset or task. This adapts general knowledge to specialized domains more efficiently than training from scratch.

Types: Full fine-tuning (all weights), Parameter-efficient (subset of weights), Instruction tuning (following instructions).

5.2 LoRA / QLoRA

LoRA (Low-Rank Adaptation): Instead of fine-tuning all weights, add small trainable "adapter" matrices. Original weights stay frozen, only adapters are trained.

QLoRA: Combines LoRA with 4-bit quantization: fine-tune a 65B model on a single 48GB GPU!

5.3 Quantization

Reducing the precision of model weights (e.g., FP32 → INT8 → INT4) to decrease memory usage and speed up inference. Modern quantization methods (GPTQ, AWQ) maintain quality surprisingly well.

Why it matters: This is why we can run Llama 70B on a gaming laptop. 4-bit quantization makes it possible.

5.4 Knowledge Distillation

Training a smaller "student" model to mimic a larger "teacher" model's behavior. The student learns from the teacher's soft probability distributions, not just hard labels, capturing more nuanced knowledge that wouldn't be available from training data alone.

Why it matters: We often need the intelligence of a 70B parameter model but can only deploy a 7B model due to cost or latency constraints. Distillation bridges this gap. The student doesn't just learn "this is a cat"; it learns "this is 90% cat, 8% dog, 2% fox," inheriting the teacher's understanding of similar concepts.

Real examples: DistilBERT retains 97% of BERT's performance with 40% fewer parameters. Microsoft's Phi models are trained partly on synthetic data generated by larger models. This technique is why we now have capable models that run on phones and laptops. The knowledge from massive models gets compressed into smaller, deployable ones.

6. Model Evaluation

Understanding model performance and failure modes.

6.1 Overfitting / Underfitting

Overfitting: Model memorizes training data but fails on new data (high training accuracy, low test accuracy). Underfitting: Model is too simple to capture patterns (poor on both).

Training vs Validation Loss Curves

Underfitting

Both losses high

Good Fit

Both converge together

Overfitting

Val loss goes up!

━━ Training - - - Validation

Signs of overfitting: Training loss keeps decreasing while validation loss increases.

6.2 Regularization

Techniques to prevent overfitting by adding constraints during training:

Dropout — Randomly disable neurons during training
L1/L2 regularization — Penalize large weights
Early stopping — Stop when validation loss increases
Data augmentation — Create variations of training data

6.3 Hallucination

When an AI model generates plausible-sounding but factually incorrect information. LLMs are particularly prone to this because they're trained to produce fluent text, not necessarily true text.

Why it matters: This is the primary reason production AI systems need RAG, guardrails, and human oversight.

Mitigations: RAG (ground responses in documents), chain-of-thought (show reasoning), confidence calibration, human feedback.

7. Prompting & Reasoning

Techniques for getting better outputs from language models.

7.1 Prompt Engineering

The practice of crafting effective inputs to guide the LLM's behavior. Key techniques:

Zero-shot — Ask directly without examples
Few-shot — Provide examples in the prompt
Role-playing — "You are an expert..."
Structured output — Request JSON, markdown, etc.

7.2 Context Engineering

The broader practice of managing all information provided to an AI system: system prompts, retrieved documents, conversation history, tool outputs, user preferences.

While prompt engineering focuses on crafting the user message, context engineering considers the full picture: What goes in the system prompt? Which documents should RAG retrieve? How much conversation history fits in the context window? Should we include tool outputs or examples?

Why it matters: For production AI systems, context engineering often matters more than the model choice. A well-engineered context with a smaller model frequently outperforms a larger model with poor context. In my experience building agents with ADK, getting context management right is often the difference between a demo and a production-ready system.

7.3 Chain-of-Thought (CoT)

A prompting technique where the model shows its reasoning step-by-step before giving a final answer. Simply adding "Let's think step by step" can boost accuracy on math and logic tasks by 10-40%.

8. Infrastructure & Retrieval

Building systems that connect LLMs to external knowledge.

8.1 Embedding

A dense vector representation of data in a continuous space where similar items are close together. Text embeddings capture semantic meaning: "king" and "queen" are close, "king" and "apple" are far.

Embedding Space — Words as Vectors

king

queen

prince

royal

apple

banana

orange

fruit

python

code

program

Similar concepts cluster together. king-man+woman ≈ queen!

The famous example: embedding("king") - embedding("man") + embedding("woman") ≈ embedding("queen")

8.2 Vector Database

A database optimized for storing and querying high-dimensional vectors. Uses approximate nearest neighbor (ANN) algorithms for fast similarity search across millions of embeddings.

Popular options: Pinecone, Weaviate, ChromaDB, Milvus, pgvector (PostgreSQL extension).

8.3 RAG (Retrieval-Augmented Generation)

A technique that retrieves relevant documents and includes them in the LLM's context before generating a response. This grounds responses in actual data, reduces hallucination, and enables access to private/current information.

graph LR Query[User Query] --> Embed[Embed Query] Embed --> Search[Vector Search] Search --> Docs[(Vector DB)] Docs --> Context[Retrieved Docs] Context --> LLM[LLM] Query --> LLM LLM --> Response[Grounded Response] style Query fill:#e3f2fd,stroke:#1565c0,stroke-width:2px style Docs fill:#fff3e0,stroke:#ef6c00,stroke-width:2px style LLM fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px style Response fill:#fce4ec,stroke:#c2185b,stroke-width:2px

RAG pipeline: query → embed → search vector DB → retrieve relevant docs → augment prompt → generate grounded response.

9. Modern AI Systems

How AI systems are being built today.

9.1 MCP (Model Context Protocol)

An open protocol by Anthropic that standardizes how AI applications connect to external data sources and tools. Instead of building custom integrations for each service, MCP provides a universal interface.

With MCP, an AI assistant can connect to Google Drive, Slack, GitHub, databases, and more through standardized "MCP servers," enabling plug-and-play integrations.

9.2 Agentic AI

AI systems that can autonomously plan, make decisions, use tools, and take actions to accomplish goals. Goes beyond simple Q&A to multi-step reasoning and execution.

graph TB User[User Goal] --> Agent[AI Agent] Agent --> Plan[Plan Steps] Plan --> Execute[Execute Actions] Execute --> Tools[Use Tools] Tools --> Web[Web Search] Tools --> Code[Code Execution] Tools --> APIs[External APIs] Execute --> Observe[Observe Results] Observe --> Agent Agent --> Result[Final Result] style User fill:#e3f2fd,stroke:#1565c0,stroke-width:2px style Agent fill:#fff3e0,stroke:#ef6c00,stroke-width:2px style Result fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px

Agentic AI loop: receive goal → plan steps → execute actions using tools → observe results → iterate until goal is achieved.

An agentic AI can: research a topic → write code → test it → fix bugs → deploy, all autonomously with minimal human intervention.

Conclusion

These 30 terms form the vocabulary of modern AI development. Whether we're building with LLMs, training custom models, or architecting AI systems, understanding these concepts helps us communicate effectively with our teams and make better technical decisions.

Keep this glossary bookmarked as a reference. The AI field moves fast, but these fundamentals stay relevant.

Resources

Attention Is All You Need — The original Transformer paper
LoRA: Low-Rank Adaptation — Parameter-efficient fine-tuning
Chain-of-Thought Prompting — Step-by-step reasoning
RAG: Retrieval-Augmented Generation — Grounding LLMs in knowledge
Model Context Protocol (MCP) — Open protocol for AI integrations

Table of Contents