A neural network trained on vast amounts of text data to understand and generate human language. Modern LLMs use the Transformer architecture with billions of parameters.
flowchart LR
A[Input Text] --> B[Tokenizer]
B --> C[Token IDs]
C --> D[Embedding Layer]
D --> E[Transformer Layers]
E --> F[LM Head]
F --> G[Token Probabilities]
G --> H[Output Text]
style E fill:#e3f2fd
Examples of LLMs:
- GPT-4 (OpenAI)
- Llama 3/4 (Meta)
- Qwen 2.5 (Alibaba)
- Mixtral (Mistral AI)
An architecture where multiple specialized sub-networks (experts) are conditionally activated based on input. A router network decides which experts to use for each token, enabling efficient scaling to much larger models.
flowchart TB
A[Input Token] --> R[Router]
R --> |"w₁ = 0.65"| E1[Expert 1]
R --> |"w₂ = 0.35"| E2[Expert 2]
R -.-> |"w₃ = 0"| E3[Expert 3
inactive]
R -.-> |"w₄ = 0"| E4[Expert 4
inactive]
E1 --> S((+))
E2 --> S
S --> O[Output]
style E1 fill:#c8e6c9
style E2 fill:#c8e6c9
style E3 fill:#f5f5f5,stroke-dasharray: 5 5
style E4 fill:#f5f5f5,stroke-dasharray: 5 5
Key Characteristics:
- Sparse Activation: Only K experts (typically 2) are active per token
- Router: Learned gating network selects experts
- Efficiency: Compute scales with active experts, not total
- Memory: All expert weights must be accessible
A neural network architecture introduced in "Attention is All You Need" (2017). It processes sequences using self-attention mechanisms instead of recurrence, enabling parallel computation and capturing long-range dependencies.
flowchart TB
subgraph Layer["Transformer Layer"]
A[Input] --> B[RMSNorm]
B --> C[Self-Attention]
C --> D((+))
A --> D
D --> E[RMSNorm]
E --> F[FFN / MoE]
F --> G((+))
D --> G
G --> H[Output]
end
style C fill:#bbdefb
style F fill:#c8e6c9
GQA
Grouped Query Attention
An optimization of Multi-Head Attention where multiple query heads share the same key-value heads. This reduces KV cache memory while maintaining most of the model quality.
flowchart TB
subgraph MHA["Multi-Head Attention"]
Q1[Q1] --- K1[K1]
Q2[Q2] --- K2[K2]
Q3[Q3] --- K3[K3]
Q4[Q4] --- K4[K4]
end
subgraph GQA["Grouped Query Attention"]
Q5[Q1] --- K5[K1]
Q6[Q2] --- K5
Q7[Q3] --- K6[K2]
Q8[Q4] --- K6
end
style MHA fill:#ffebee
style GQA fill:#e8f5e9
| Type |
Query Heads |
KV Heads |
KV Cache Size |
| MHA |
32 |
32 |
100% |
| GQA |
32 |
8 |
25% |
| MQA |
32 |
1 |
3.125% |
RoPE
Rotary Position Embedding
A method to encode positional information by rotating query and key vectors in complex space. Unlike absolute positional embeddings, RoPE encodes relative positions naturally through rotation, enabling better extrapolation to longer sequences.
flowchart LR
subgraph Position["Position Encoding"]
P[Position m] --> T["θ = base^(-2i/d)"]
T --> CS["cos(mθ), sin(mθ)"]
end
subgraph Rotation["Vector Rotation"]
X["[x₁, x₂]"] --> R["Rotation Matrix"]
CS --> R
R --> Y["[x₁', x₂']"]
end
style Position fill:#fff3e0
style Rotation fill:#e3f2fd
Rotation Formula:
x'[2i] = x[2i] * cos(m * theta_i) - x[2i+1] * sin(m * theta_i)
x'[2i+1] = x[2i] * sin(m * theta_i) + x[2i+1] * cos(m * theta_i)
Q4_K
4-bit K-Quant Format
A 4-bit quantization format from llama.cpp that achieves ~4.5 bits per weight. It uses super-blocks of 256 weights with hierarchical scaling for high quality compression.
flowchart TB
subgraph Block["Q4_K Block (144 bytes)"]
D["d: FP16 scale
(2 bytes)"]
DM["dmin: FP16 min
(2 bytes)"]
SC["scales: packed
(12 bytes)"]
QS["qs: 4-bit values
(128 bytes)"]
end
subgraph Dequant["Dequantization"]
D --> M1["d * scale"]
DM --> M2["dmin * min"]
QS --> V["nibble value"]
M1 --> R["result = d*sc*q - dmin*m"]
M2 --> R
V --> R
end
style Block fill:#e1f5fe
style Dequant fill:#fff8e1
Format Specifications:
- Block Size: 256 elements
- Bytes per Block: 144 bytes
- Bits per Weight: 4.5 bpw
- Compression Ratio: 3.5x vs FP16
Q6_K
6-bit K-Quant Format
A 6-bit quantization format offering higher precision than Q4_K. Each block stores 256 weights in 210 bytes, achieving ~6.5 bits per weight with excellent quality preservation.
| Property |
Q4_K |
Q6_K |
| Block Size |
256 |
256 |
| Bytes/Block |
144 |
210 |
| Bits/Weight |
4.5 |
6.5 |
| Quality |
Good |
Very Good |
FP16
Half-Precision Floating Point
A 16-bit floating-point format (IEEE 754-2008) commonly used in deep learning for its balance between precision and memory efficiency. Modern GPUs have dedicated FP16 tensor cores for accelerated matrix operations.
flowchart LR
subgraph FP16["FP16 Format (16 bits)"]
S["Sign
1 bit"]
E["Exponent
5 bits"]
M["Mantissa
10 bits"]
end
subgraph Range["Value Range"]
R["±65504
±6.1e-5 to ±65504"]
end
style FP16 fill:#e8eaf6
Characteristics:
- Range: approximately 6.1e-5 to 65504
- Precision: ~3.3 decimal digits
- Memory: 2 bytes per value
- GPU Support: Native tensor core acceleration
FP32
Single-Precision Floating Point
A 32-bit floating-point format providing higher precision than FP16. Used for accumulation in matrix operations to prevent precision loss during summation of many values.
- Range: approximately 1.2e-38 to 3.4e38
- Precision: ~7.2 decimal digits
- Memory: 4 bytes per value
- Use Case: Accumulation, loss computation
A memory optimization that stores previously computed Key and Value tensors during autoregressive generation. This avoids recomputing attention for all previous tokens, reducing generation from O(n^2) to O(n).
flowchart TB
subgraph Gen1["Generation Step 1"]
T1["Token: Hello"] --> KV1["K₁, V₁"]
end
subgraph Gen2["Generation Step 2"]
T2["Token: World"] --> KV2["K₂, V₂"]
Cache1["Cache: K₁, V₁"] --> A2[Attention]
KV2 --> A2
end
subgraph Gen3["Generation Step 3"]
T3["Token: !"] --> KV3["K₃, V₃"]
Cache2["Cache: K₁,K₂, V₁,V₂"] --> A3[Attention]
KV3 --> A3
end
Gen1 --> Gen2 --> Gen3
style Cache1 fill:#fff3e0
style Cache2 fill:#fff3e0
Expert Cache
MoE Expert Weight Cache
A specialized cache system for MoE models that keeps frequently used expert weights in GPU memory. Essential for large models where all experts cannot fit in VRAM simultaneously.
flowchart TB
subgraph Storage["Weight Storage"]
M[Model File
mmap]
end
subgraph Cache["GPU Expert Cache"]
direction LR
E1["Expert 0
Layer 5"]
E2["Expert 3
Layer 5"]
E3["Expert 1
Layer 6"]
Empty["Empty Slot"]
end
subgraph Policy["Eviction Policy"]
LRU["LRU: Evict oldest"]
end
M -->|"Load on demand"| Cache
Cache -->|"When full"| Policy
Policy -->|"Evict"| M
style Cache fill:#e8f5e9
style Policy fill:#ffebee
cuBLAS
CUDA Basic Linear Algebra Subroutines
NVIDIA's GPU-accelerated library for dense linear algebra operations. Provides highly optimized implementations of BLAS routines including GEMM (matrix multiplication).
Key Functions Used:
cublasGemmEx - Mixed-precision matrix multiplication
cublasHgemm - FP16 matrix multiplication
cublasSgemm - FP32 matrix multiplication
GEMM
General Matrix Multiply
The fundamental operation in neural network computation: C = alpha * A * B + beta * C. Modern GPUs are highly optimized for this operation, with tensor cores providing massive parallelism.
flowchart LR
A["A
[M x K]"] --> GEMM["GEMM"]
B["B
[K x N]"] --> GEMM
GEMM --> C["C
[M x N]"]
style GEMM fill:#e3f2fd
GEMM Formula:
C[i,j] = alpha * sum(A[i,k] * B[k,j]) + beta * C[i,j]
NVIDIA GB10
NVIDIA DGX Spark / Project DIGITS
A compact AI supercomputer based on the Grace Blackwell architecture. Features unified CPU-GPU memory architecture with 128GB shared memory, ideal for running large LLM models locally.
Specifications:
| GPU | Blackwell (B200 derivative) |
| CPU | Grace ARM64 |
| Unified Memory | 128GB LPDDR5X |
| Memory Bandwidth | ~273 GB/s |
| Tensor Cores | 5th Generation |
| CUDA Compute | SM 100 |
mmap
Memory-Mapped File I/O
A system call that maps a file into the process's virtual address space. Enables direct access to file contents as if they were in memory, with the OS handling paging. Essential for loading large model files efficiently.
flowchart LR
subgraph Disk["Disk Storage"]
F["Model File
100GB"]
end
subgraph VM["Virtual Memory"]
M["Mapped Region"]
end
subgraph RAM["Physical RAM"]
P1["Page 1"]
P2["Page 2"]
P3["..."]
end
F -->|"mmap()"| VM
VM -->|"On access"| RAM
style VM fill:#e8f5e9
RMSNorm
Root Mean Square Layer Normalization
A simplified normalization technique that only uses the RMS (root mean square) of activations, omitting the mean subtraction of LayerNorm. More efficient and works equally well for transformers.
Formula:
RMSNorm(x) = x / sqrt(mean(x^2) + epsilon) * gamma
SwiGLU
Swish-Gated Linear Unit
An activation function that combines the Swish activation with a gating mechanism. Used in the FFN layers of modern transformers for improved expressiveness and training stability.
Formula:
SwiGLU(x, W_gate, W_up) = SiLU(x * W_gate) * (x * W_up)
SiLU(x) = x * sigmoid(x)
Top-K Selection
Top-K Expert Selection
In MoE models, the process of selecting the K highest-scoring experts based on router logits. Typically K=2, meaning 2 experts are activated per token.
flowchart TB
R["Router Scores"] --> S["Sort by score"]
S --> T["Take top K"]
T --> N["Normalize weights"]
subgraph Selection["K=2 Selection"]
E1["Expert 5: 0.65"]
E2["Expert 2: 0.35"]
end
N --> Selection
style Selection fill:#e8f5e9
Quantization Error
Weight Quantization Error
The difference between original floating-point weights and their quantized representation. Measured as RMSE (Root Mean Square Error) or perplexity degradation. Lower bit formats have higher error but smaller model size.
RMSE Formula:
RMSE = sqrt(mean((original - dequantized)^2))
Paged KV Cache
Paged Attention KV Cache
A memory management technique inspired by virtual memory paging. Stores KV cache in non-contiguous blocks, enabling efficient memory sharing between sequences and reducing fragmentation.
Benefits:
- Reduces memory fragmentation
- Enables sharing prefixes between requests
- Better memory utilization for variable-length sequences