Technical Documentation

Mathematical Foundations

Transformer Architecture

The transformer architecture processes sequences through self-attention and feed-forward layers. For an input sequence X ∈ ℝ^n×d:

Layer Normalization (RMSNorm) HugeModel uses RMSNorm instead of LayerNorm for efficiency: RMSNorm(x) = x / \sqrt(mean(x²) + ε) \times γ Where γ is the learnable scale parameter and ε = 10 -6 for numerical stability.

Self-Attention Attention(Q, K, V) = softmax(QK T / \sqrtd k) \times V Where: Q = XW Q (Query projection) K = XW K (Key projection) V = XW V (Value projection) d k = head dimension (typically 128)

Feed-Forward Network (SwiGLU) HugeModel uses SwiGLU activation: FFN(x) = (SiLU(xW gate) ⊙ xW up) \times W down Where SiLU(x) = x \times σ(x) and ⊙ denotes element-wise multiplication.

Attention Mechanism Details

Grouped Query Attention (GQA)

GQA reduces memory bandwidth by sharing K and V projections across multiple Q heads:

flowchart LR subgraph QHeads["Query Heads (32)"] Q1[Q₁] Q2[Q₂] Q3[Q₃] Q4[Q₄] Q5[...] Q32[Q₃₂] end subgraph KVHeads["KV Heads (4)"] KV1[K₁V₁] KV2[K₂V₂] KV3[K₃V₃] KV4[K₄V₄] end Q1 --> KV1 Q2 --> KV1 Q3 --> KV1 Q4 --> KV1 Q5 --> KV1 Q32 --> KV4 style QHeads fill:#e3f2fd style KVHeads fill:#fff3e0

GQA Implementation


// Group size: num_heads / num_kv_heads
size_t kvGroupSize = numHeads / numKVHeads;  // e.g., 32/4 = 8

for (size_t h = 0; h < numHeads; ++h) {
    size_t kvHead = h / kvGroupSize;  // Map Q head to KV head
    
    // Compute attention scores for this Q head
    // using K[kvHead] and V[kvHead]
    for (int t = 0; t < seqLen; ++t) {
        float score = 0.0f;
        for (size_t d = 0; d < headDim; ++d) {
            score += Q[h * headDim + d] * K[t * kvDim + kvHead * headDim + d];
        }
        scores[t] = score * scale;  // scale = 1/√d_k
    }
    // Softmax and weighted sum with V...
}

Rotary Position Embedding (RoPE)

RoPE encodes position information by rotating query and key vectors:

RoPE(x, pos) = R θ,pos \times x Where for dimension pair (2i, 2i+1) : θ i = base -2i/d x'[2i] = x[2i] \times cos(pos \times θ i) - x[2i+1] \times sin(pos \times θ i) x'[2i+1] = x[2i] \times sin(pos \times θ i) + x[2i+1] \times cos(pos \times θ i) Default base θ = 10000 (Qwen2.5) or 500000 (Llama 3+)

RoPE CUDA Kernel


__global__ void applyRoPEInlineKernel(
    __half* Q, __half* K,
    int seqLen, int numHeads, int numKVHeads, int headDim,
    int posOffset, float ropeTheta)
{
    int pos = blockIdx.x + posOffset;
    int headIdx = blockIdx.y;
    int d = threadIdx.x * 2;  // Process pairs
    
    if (d >= headDim) return;
    
    float freq = 1.0f / powf(ropeTheta, (float)d / headDim);
    float angle = pos * freq;
    float cos_val = cosf(angle);
    float sin_val = sinf(angle);
    
    // Apply rotation to Q
    float q0 = __half2float(Q[headIdx * headDim + d]);
    float q1 = __half2float(Q[headIdx * headDim + d + 1]);
    Q[headIdx * headDim + d] = __float2half(q0 * cos_val - q1 * sin_val);
    Q[headIdx * headDim + d + 1] = __float2half(q0 * sin_val + q1 * cos_val);
    
    // Similarly for K...
}

MoE Routing Algorithms

Router Architecture

The Mixture-of-Experts router selects which experts process each token:

flowchart TB subgraph Routing["Expert Routing"] A[Hidden State
h ∈ ℝᵈ] --> B[Gate Projection
W_gate ∈ ℝᵈˣⁿ] B --> C[Logits
z ∈ ℝⁿ] C --> D{Gating Type} D -->|Softmax| E[softmax z] D -->|Sigmoid| F[σ z] E --> G[Top-K Selection] F --> G G --> H[Expert Indices
+ Weights] end style Routing fill:#fff8e1

Gating Functions

Gating Type	Formula	Use Case	Properties
Softmax	`w_i = exp(z_i) / Σexp(z_j)`	Mixtral, DeepSeek	Σw_i = 1, competitive
Sigmoid	`w_i = σ(z_i) = 1/(1+e^-z_i)`	Llama 4	Independent, 0 ≤ w_i ≤ 1

Expert Computation

For Top-K selected experts:

output = Σ_i∈TopK w_i × Expert_i(x)

Each expert is a SwiGLU FFN:

Expert_i(x) = (SiLU(xW_gate,i) ⊙ xW_up,i) × W_down,i

MoE Forward Pass


void runMoELayer(const float* hidden, float* output, int layerIdx) {
    // 1. Router: compute expert weights
    float gateLogits[numExperts];
    gemm(hidden, gateWeights[layerIdx], gateLogits, 1, numExperts, hiddenSize);
    
    // 2. Select Top-K experts
    auto [topK_indices, topK_weights] = selectTopK(gateLogits, K);
    
    // 3. Normalize weights (for softmax gating)
    if (gatingType == SOFTMAX) {
        float sum = 0;
        for (int i = 0; i < K; i++) sum += topK_weights[i];
        for (int i = 0; i < K; i++) topK_weights[i] /= sum;
    }
    
    // 4. Compute expert outputs
    memset(output, 0, hiddenSize * sizeof(float));
    for (int i = 0; i < K; i++) {
        int expertId = topK_indices[i];
        float weight = topK_weights[i];
        
        // Load expert from cache or mmap
        Expert* expert = expertCache->getOrLoad(layerIdx, expertId);
        
        // FFN computation
        float expertOut[hiddenSize];
        expert->forward(hidden, expertOut);
        
        // Weighted accumulation
        for (int j = 0; j < hiddenSize; j++) {
            output[j] += weight * expertOut[j];
        }
    }
}

Quantization Mathematics

Q4_K Dequantization

The Q4_K format stores 256 weights in 144 bytes:

Block Layout (144 bytes) Offset 0-1 d FP16 super-block scale Offset 2-3 dmin FP16 super-block min Offset 4-15 scales[12] 6-bit scales + 4-bit mins (packed) Offset 16-143 qs[128] 4-bit quantized values (2 per byte)

Offset 0-1	`d`	FP16 super-block scale
Offset 2-3	`dmin`	FP16 super-block min
Offset 4-15	`scales[12]`	6-bit scales + 4-bit mins (packed)
Offset 16-143	`qs[128]`	4-bit quantized values (2 per byte)

Q4_K Dequantization Algorithm


void dequantQ4KRow(const uint8_t* src, float* dst, size_t numElements) {
    constexpr size_t QK_K = 256;
    constexpr size_t blockSize = 144;
    
    for (size_t ib = 0; ib < numElements / QK_K; ++ib) {
        const uint8_t* block = src + ib * blockSize;
        
        // Extract FP16 scales
        float d = fp16_to_fp32(*(uint16_t*)(block + 0));
        float dmin = fp16_to_fp32(*(uint16_t*)(block + 2));
        
        const uint8_t* scales = block + 4;
        const uint8_t* qs = block + 16;
        
        float* y = dst + ib * QK_K;
        
        // Process 4 groups of 64 elements
        int scaleIdx = 0;
        for (int j = 0; j < QK_K; j += 64) {
            // Extract 6-bit scale and min for each group
            uint8_t sc, m;
            getScaleMinK4(scaleIdx, scales, &sc, &m);
            float d1 = d * sc, m1 = dmin * m;
            
            getScaleMinK4(scaleIdx + 1, scales, &sc, &m);
            float d2 = d * sc, m2 = dmin * m;
            
            // Low nibbles -> first 32 elements
            for (int l = 0; l < 32; ++l) {
                *y++ = d1 * (qs[l] & 0xF) - m1;
            }
            // High nibbles -> next 32 elements
            for (int l = 0; l < 32; ++l) {
                *y++ = d2 * (qs[l] >> 4) - m2;
            }
            
            qs += 32;
            scaleIdx += 2;
        }
    }
}

Quantization Error Analysis

Expected quantization error for different formats:

Format	Bits/Weight	RMSE	Perplexity Impact
FP16	16	~0	Baseline
Q8_0	8.5	~0.001	+0.01
Q6_K	6.5	~0.003	+0.05
Q4_K	4.5	~0.008	+0.15

Memory Optimization Techniques

Unified Memory on GB10

The NVIDIA GB10 provides 128GB of unified CPU/GPU memory:

flowchart TB subgraph Physical["Physical Memory (128GB)"] A[LPDDR5X Banks] end subgraph Virtual["Virtual Address Space"] B[CPU Mapping] C[GPU Mapping] end subgraph Access["Access Patterns"] D[cudaMallocManaged] E[Direct Host Access] F[GPU Kernel Access] end Physical --> B Physical --> C D --> B D --> C E --> B F --> C style Physical fill:#c8e6c9 style Virtual fill:#bbdefb style Access fill:#fff3e0

Memory-Mapped I/O Strategy


// Zero-copy model loading with mmap
void* loadModelMmap(const char* path) {
    int fd = open(path, O_RDONLY);
    struct stat sb;
    fstat(fd, &sb);
    
    void* mapped = mmap(nullptr, sb.st_size,
                        PROT_READ,
                        MAP_PRIVATE | MAP_POPULATE,  // Prefetch
                        fd, 0);
    
    // Advise kernel about access pattern
    madvise(mapped, sb.st_size, MADV_SEQUENTIAL);
    
    return mapped;
}

// Direct GPU access to mmap'd region (via unified memory)
__global__ void dequantKernel(const uint8_t* mmapWeights, __half* gpuOutput) {
    // mmapWeights can be accessed directly from GPU on GB10
    // due to unified memory architecture
}

KV Cache Memory Management

The Paged KV Cache optimizes memory for variable-length sequences:

Memory Calculation KV_Memory = 2 \times num_layers \times seq_len \times kv_dim \times sizeof(FP16) For Qwen2.5-3B with seq_len=2048: = 2 \times 36 \times 2048 \times 256 \times 2 bytes = 75.5 MB

Numerical Stability

FP32 Accumulation

HugeModel uses FP16 storage with FP32 accumulation to prevent precision loss:

cuBLAS Mixed Precision


void gemmFP16_FP32Accum(cublasHandle_t cublas,
                        cublasOperation_t opA, cublasOperation_t opB,
                        int M, int N, int K,
                        float alpha,
                        const __half* A, int lda,
                        const __half* B, int ldb,
                        float beta,
                        __half* C, int ldc)
{
    cublasGemmEx(
        cublas,
        opA, opB,
        M, N, K,
        &alpha,                    // FP32 scaling
        A, CUDA_R_16F, lda,       // FP16 input A
        B, CUDA_R_16F, ldb,       // FP16 input B
        &beta,
        C, CUDA_R_16F, ldc,       // FP16 output C
        CUBLAS_COMPUTE_32F,       // FP32 internal accumulation
        CUBLAS_GEMM_DEFAULT
    );
}

Softmax Numerical Stability

Naive softmax can overflow. We use the stable version: softmax(x) i = exp(x i - max(x)) / Σ exp(x j - max(x))


__global__ void stableSoftmax(float* x, int n) {
    __shared__ float sMax, sSum;
    
    // 1. Find max (parallel reduction)
    float localMax = -FLT_MAX;
    for (int i = threadIdx.x; i < n; i += blockDim.x) {
        localMax = fmaxf(localMax, x[i]);
    }
    // ... reduction to get global max ...
    
    // 2. Compute exp(x - max) and sum
    float localSum = 0.0f;
    for (int i = threadIdx.x; i < n; i += blockDim.x) {
        x[i] = expf(x[i] - sMax);
        localSum += x[i];
    }
    // ... reduction to get global sum ...
    
    // 3. Normalize
    for (int i = threadIdx.x; i < n; i += blockDim.x) {
        x[i] /= sSum;
    }
}

Performance Analysis

Computational Complexity

Operation	FLOPs per Token	Memory Bandwidth
QKV Projection	6 × d × d	3 × d² × 2B
Attention (per head)	4 × seq × d_head	2 × seq × d_head × 2B
FFN (SwiGLU)	8 × d × 4d	3 × 4d² × 2B
MoE Expert (K active)	K × 8 × d × intermediate	K × 3 × intermediate × d × 2B

Bottleneck Analysis

pie title Time Distribution (Qwen2.5-3B, Single Token) "Dequantization" : 45 "QKV GEMM" : 20 "Attention" : 15 "FFN GEMM" : 15 "Other" : 5

Optimization Opportunities

1. GPU Dequantization

Current: CPU dequant + PCIe transfer

Improvement: Direct GPU kernel dequantization

Expected speedup: 5-10x for memory-bound operations

2. Async Expert Prefetch

Current: Synchronous expert loading

Improvement: Predict and prefetch next experts

Expected speedup: Hide 80% of load latency

3. Flash Attention

Current: Standard attention with full materialization

Improvement: Tiled, fused attention kernel

Expected speedup: 2-3x for long sequences