LLM

Large Language Model

A neural network trained on vast amounts of text data to understand and generate human language. Modern LLMs use the Transformer architecture with billions of parameters.

flowchart LR A[Input Text] --> B[Tokenizer] B --> C[Token IDs] C --> D[Embedding Layer] D --> E[Transformer Layers] E --> F[LM Head] F --> G[Token Probabilities] G --> H[Output Text] style E fill:#e3f2fd

Examples of LLMs:

  • GPT-4 (OpenAI)
  • Llama 3/4 (Meta)
  • Qwen 2.5 (Alibaba)
  • Mixtral (Mistral AI)

MoE

Mixture of Experts

An architecture where multiple specialized sub-networks (experts) are conditionally activated based on input. A router network decides which experts to use for each token, enabling efficient scaling to much larger models.

flowchart TB A[Input Token] --> R[Router] R --> |"w₁ = 0.65"| E1[Expert 1] R --> |"w₂ = 0.35"| E2[Expert 2] R -.-> |"w₃ = 0"| E3[Expert 3
inactive] R -.-> |"w₄ = 0"| E4[Expert 4
inactive] E1 --> S((+)) E2 --> S S --> O[Output] style E1 fill:#c8e6c9 style E2 fill:#c8e6c9 style E3 fill:#f5f5f5,stroke-dasharray: 5 5 style E4 fill:#f5f5f5,stroke-dasharray: 5 5

Key Characteristics:

  • Sparse Activation: Only K experts (typically 2) are active per token
  • Router: Learned gating network selects experts
  • Efficiency: Compute scales with active experts, not total
  • Memory: All expert weights must be accessible

Transformer

Transformer Architecture

A neural network architecture introduced in "Attention is All You Need" (2017). It processes sequences using self-attention mechanisms instead of recurrence, enabling parallel computation and capturing long-range dependencies.

flowchart TB subgraph Layer["Transformer Layer"] A[Input] --> B[RMSNorm] B --> C[Self-Attention] C --> D((+)) A --> D D --> E[RMSNorm] E --> F[FFN / MoE] F --> G((+)) D --> G G --> H[Output] end style C fill:#bbdefb style F fill:#c8e6c9

GQA

Grouped Query Attention

An optimization of Multi-Head Attention where multiple query heads share the same key-value heads. This reduces KV cache memory while maintaining most of the model quality.

flowchart TB subgraph MHA["Multi-Head Attention"] Q1[Q1] --- K1[K1] Q2[Q2] --- K2[K2] Q3[Q3] --- K3[K3] Q4[Q4] --- K4[K4] end subgraph GQA["Grouped Query Attention"] Q5[Q1] --- K5[K1] Q6[Q2] --- K5 Q7[Q3] --- K6[K2] Q8[Q4] --- K6 end style MHA fill:#ffebee style GQA fill:#e8f5e9
Type Query Heads KV Heads KV Cache Size
MHA 32 32 100%
GQA 32 8 25%
MQA 32 1 3.125%

RoPE

Rotary Position Embedding

A method to encode positional information by rotating query and key vectors in complex space. Unlike absolute positional embeddings, RoPE encodes relative positions naturally through rotation, enabling better extrapolation to longer sequences.

flowchart LR subgraph Position["Position Encoding"] P[Position m] --> T["θ = base^(-2i/d)"] T --> CS["cos(mθ), sin(mθ)"] end subgraph Rotation["Vector Rotation"] X["[x₁, x₂]"] --> R["Rotation Matrix"] CS --> R R --> Y["[x₁', x₂']"] end style Position fill:#fff3e0 style Rotation fill:#e3f2fd

Rotation Formula:

x'[2i] = x[2i] * cos(m * theta_i) - x[2i+1] * sin(m * theta_i)
x'[2i+1] = x[2i] * sin(m * theta_i) + x[2i+1] * cos(m * theta_i)

Q4_K

4-bit K-Quant Format

A 4-bit quantization format from llama.cpp that achieves ~4.5 bits per weight. It uses super-blocks of 256 weights with hierarchical scaling for high quality compression.

flowchart TB subgraph Block["Q4_K Block (144 bytes)"] D["d: FP16 scale
(2 bytes)"] DM["dmin: FP16 min
(2 bytes)"] SC["scales: packed
(12 bytes)"] QS["qs: 4-bit values
(128 bytes)"] end subgraph Dequant["Dequantization"] D --> M1["d * scale"] DM --> M2["dmin * min"] QS --> V["nibble value"] M1 --> R["result = d*sc*q - dmin*m"] M2 --> R V --> R end style Block fill:#e1f5fe style Dequant fill:#fff8e1

Format Specifications:

  • Block Size: 256 elements
  • Bytes per Block: 144 bytes
  • Bits per Weight: 4.5 bpw
  • Compression Ratio: 3.5x vs FP16

Q6_K

6-bit K-Quant Format

A 6-bit quantization format offering higher precision than Q4_K. Each block stores 256 weights in 210 bytes, achieving ~6.5 bits per weight with excellent quality preservation.

Property Q4_K Q6_K
Block Size 256 256
Bytes/Block 144 210
Bits/Weight 4.5 6.5
Quality Good Very Good

FP16

Half-Precision Floating Point

A 16-bit floating-point format (IEEE 754-2008) commonly used in deep learning for its balance between precision and memory efficiency. Modern GPUs have dedicated FP16 tensor cores for accelerated matrix operations.

flowchart LR subgraph FP16["FP16 Format (16 bits)"] S["Sign
1 bit"] E["Exponent
5 bits"] M["Mantissa
10 bits"] end subgraph Range["Value Range"] R["±65504
±6.1e-5 to ±65504"] end style FP16 fill:#e8eaf6

Characteristics:

  • Range: approximately 6.1e-5 to 65504
  • Precision: ~3.3 decimal digits
  • Memory: 2 bytes per value
  • GPU Support: Native tensor core acceleration

FP32

Single-Precision Floating Point

A 32-bit floating-point format providing higher precision than FP16. Used for accumulation in matrix operations to prevent precision loss during summation of many values.

  • Range: approximately 1.2e-38 to 3.4e38
  • Precision: ~7.2 decimal digits
  • Memory: 4 bytes per value
  • Use Case: Accumulation, loss computation

KV Cache

Key-Value Cache

A memory optimization that stores previously computed Key and Value tensors during autoregressive generation. This avoids recomputing attention for all previous tokens, reducing generation from O(n^2) to O(n).

flowchart TB subgraph Gen1["Generation Step 1"] T1["Token: Hello"] --> KV1["K₁, V₁"] end subgraph Gen2["Generation Step 2"] T2["Token: World"] --> KV2["K₂, V₂"] Cache1["Cache: K₁, V₁"] --> A2[Attention] KV2 --> A2 end subgraph Gen3["Generation Step 3"] T3["Token: !"] --> KV3["K₃, V₃"] Cache2["Cache: K₁,K₂, V₁,V₂"] --> A3[Attention] KV3 --> A3 end Gen1 --> Gen2 --> Gen3 style Cache1 fill:#fff3e0 style Cache2 fill:#fff3e0

Expert Cache

MoE Expert Weight Cache

A specialized cache system for MoE models that keeps frequently used expert weights in GPU memory. Essential for large models where all experts cannot fit in VRAM simultaneously.

flowchart TB subgraph Storage["Weight Storage"] M[Model File
mmap] end subgraph Cache["GPU Expert Cache"] direction LR E1["Expert 0
Layer 5"] E2["Expert 3
Layer 5"] E3["Expert 1
Layer 6"] Empty["Empty Slot"] end subgraph Policy["Eviction Policy"] LRU["LRU: Evict oldest"] end M -->|"Load on demand"| Cache Cache -->|"When full"| Policy Policy -->|"Evict"| M style Cache fill:#e8f5e9 style Policy fill:#ffebee

cuBLAS

CUDA Basic Linear Algebra Subroutines

NVIDIA's GPU-accelerated library for dense linear algebra operations. Provides highly optimized implementations of BLAS routines including GEMM (matrix multiplication).

Key Functions Used:

  • cublasGemmEx - Mixed-precision matrix multiplication
  • cublasHgemm - FP16 matrix multiplication
  • cublasSgemm - FP32 matrix multiplication

GEMM

General Matrix Multiply

The fundamental operation in neural network computation: C = alpha * A * B + beta * C. Modern GPUs are highly optimized for this operation, with tensor cores providing massive parallelism.

flowchart LR A["A
[M x K]"] --> GEMM["GEMM"] B["B
[K x N]"] --> GEMM GEMM --> C["C
[M x N]"] style GEMM fill:#e3f2fd

GEMM Formula:

C[i,j] = alpha * sum(A[i,k] * B[k,j]) + beta * C[i,j]

NVIDIA GB10

NVIDIA DGX Spark / Project DIGITS

A compact AI supercomputer based on the Grace Blackwell architecture. Features unified CPU-GPU memory architecture with 128GB shared memory, ideal for running large LLM models locally.

Specifications:

GPUBlackwell (B200 derivative)
CPUGrace ARM64
Unified Memory128GB LPDDR5X
Memory Bandwidth~273 GB/s
Tensor Cores5th Generation
CUDA ComputeSM 100

mmap

Memory-Mapped File I/O

A system call that maps a file into the process's virtual address space. Enables direct access to file contents as if they were in memory, with the OS handling paging. Essential for loading large model files efficiently.

flowchart LR subgraph Disk["Disk Storage"] F["Model File
100GB"] end subgraph VM["Virtual Memory"] M["Mapped Region"] end subgraph RAM["Physical RAM"] P1["Page 1"] P2["Page 2"] P3["..."] end F -->|"mmap()"| VM VM -->|"On access"| RAM style VM fill:#e8f5e9

RMSNorm

Root Mean Square Layer Normalization

A simplified normalization technique that only uses the RMS (root mean square) of activations, omitting the mean subtraction of LayerNorm. More efficient and works equally well for transformers.

Formula:

RMSNorm(x) = x / sqrt(mean(x^2) + epsilon) * gamma

SwiGLU

Swish-Gated Linear Unit

An activation function that combines the Swish activation with a gating mechanism. Used in the FFN layers of modern transformers for improved expressiveness and training stability.

Formula:

SwiGLU(x, W_gate, W_up) = SiLU(x * W_gate) * (x * W_up)

SiLU(x) = x * sigmoid(x)

Top-K Selection

Top-K Expert Selection

In MoE models, the process of selecting the K highest-scoring experts based on router logits. Typically K=2, meaning 2 experts are activated per token.

flowchart TB R["Router Scores"] --> S["Sort by score"] S --> T["Take top K"] T --> N["Normalize weights"] subgraph Selection["K=2 Selection"] E1["Expert 5: 0.65"] E2["Expert 2: 0.35"] end N --> Selection style Selection fill:#e8f5e9

Quantization Error

Weight Quantization Error

The difference between original floating-point weights and their quantized representation. Measured as RMSE (Root Mean Square Error) or perplexity degradation. Lower bit formats have higher error but smaller model size.

RMSE Formula:

RMSE = sqrt(mean((original - dequantized)^2))

Paged KV Cache

Paged Attention KV Cache

A memory management technique inspired by virtual memory paging. Stores KV cache in non-contiguous blocks, enabling efficient memory sharing between sequences and reducing fragmentation.

Benefits:

  • Reduces memory fragmentation
  • Enables sharing prefixes between requests
  • Better memory utilization for variable-length sequences