HugeModel Inference Engine
Run Large Language Models Beyond GPU Memory Limits
A high-performance C++/CUDA inference engine designed for Mixture-of-Experts models on NVIDIA GB10 with unified memory architecture.
Project Overview
Problem Statement
Modern LLMs like Llama 4 (400B+ parameters) exceed typical GPU memory (24-80GB). Traditional solutions require expensive multi-GPU setups or significant model compression.
Our Solution
HugeModel leverages unified memory on NVIDIA GB10 (128GB shared CPU/GPU memory) combined with MoE architecture to efficiently run models that would otherwise be impossible on single-GPU systems.
Key Innovation
Memory-mapped weights with on-demand dequantization, intelligent expert caching, and optimized cuBLAS operations enable efficient inference with minimal memory footprint.
Key Features
Memory Efficiency
- Memory-mapped model weights (zero-copy loading)
- Q4_K and Q6_K quantization support
- Dynamic expert loading with LRU caching
- Paged KV Cache for long sequences
Performance
- FP16 computation with FP32 accumulation
- Optimized GEMM via cuBLAS
- RoPE positional encoding
- Grouped Query Attention
Target Hardware
- NVIDIA GB10 (Blackwell architecture)
- 128GB unified CPU/GPU memory
- SM 100 compute capability
- PCIe Gen5 / NVLink support
System Architecture Overview
flowchart TB
subgraph Input["Input Processing"]
A[Text Prompt] --> B[Tokenizer]
B --> C[Token IDs]
end
subgraph Memory["Memory Management"]
D[(Model Weights
mmap)] E[(Expert Cache
LRU)] F[(KV Cache
Paged)] end subgraph Inference["Inference Pipeline"] C --> G[Embedding Lookup] G --> H[Transformer Layers] H --> I[MoE Routing] I --> J[Expert Computation] J --> K[Attention] K --> L[Output Projection] L --> M[LM Head] M --> N[Sampling] end D --> G D --> H E --> J F --> K N --> O[Generated Token] O -->|Loop| C style Input fill:#e1f5fe style Memory fill:#fff3e0 style Inference fill:#e8f5e9
mmap)] E[(Expert Cache
LRU)] F[(KV Cache
Paged)] end subgraph Inference["Inference Pipeline"] C --> G[Embedding Lookup] G --> H[Transformer Layers] H --> I[MoE Routing] I --> J[Expert Computation] J --> K[Attention] K --> L[Output Projection] L --> M[LM Head] M --> N[Sampling] end D --> G D --> H E --> J F --> K N --> O[Generated Token] O -->|Loop| C style Input fill:#e1f5fe style Memory fill:#fff3e0 style Inference fill:#e8f5e9
Supported Models
| Model | Parameters | Architecture | Quantization | Status |
|---|---|---|---|---|
| Qwen2.5-3B | 3B | Dense Transformer | Q4_K_M | Tested |
| Llama 4 Scout | 17B (x128 experts) | MoE | Q4_K / Q6_K | In Progress |
| Mixtral 8x7B | 47B | MoE (8 experts) | Q4_K | Planned |
Inference Workflow
sequenceDiagram
participant U as User
participant T as Tokenizer
participant E as Engine
participant M as Memory Manager
participant G as GPU
U->>T: Input Prompt
T->>E: Token IDs
loop For each layer
E->>M: Request Weights
M->>M: Check mmap cache
M->>G: Dequantize Q4K/Q6K
G->>G: RMSNorm
G->>G: QKV Projection
G->>G: RoPE Encoding
G->>G: Attention (GQA)
alt MoE Layer
G->>G: Router (Top-K)
G->>M: Load Experts
M->>G: Expert FFN
else Dense Layer
G->>G: FFN (Gate+Up+Down)
end
G->>G: Residual Add
end
G->>G: LM Head
G->>G: Sampling
E->>T: Token ID
T->>U: Generated Text
Performance Metrics
~0.1
tok/s (Qwen2.5-3B)
CPU dequant path
128GB
Unified Memory
GB10 Target
Q4_K
4.5 bits/weight
144 bytes/256 elements
256
Max Experts
With LRU caching