HugeModel Inference Engine

Run Large Language Models Beyond GPU Memory Limits

A high-performance C++/CUDA inference engine designed for Mixture-of-Experts models on NVIDIA GB10 with unified memory architecture.

Project Overview

Problem Statement

Modern LLMs like Llama 4 (400B+ parameters) exceed typical GPU memory (24-80GB). Traditional solutions require expensive multi-GPU setups or significant model compression.

Our Solution

HugeModel leverages unified memory on NVIDIA GB10 (128GB shared CPU/GPU memory) combined with MoE architecture to efficiently run models that would otherwise be impossible on single-GPU systems.

Key Innovation

Memory-mapped weights with on-demand dequantization, intelligent expert caching, and optimized cuBLAS operations enable efficient inference with minimal memory footprint.

Key Features

💾

Memory Efficiency

Memory-mapped model weights (zero-copy loading)
Q4_K and Q6_K quantization support
Dynamic expert loading with LRU caching
Paged KV Cache for long sequences

⚡

Performance

FP16 computation with FP32 accumulation
Optimized GEMM via cuBLAS
RoPE positional encoding
Grouped Query Attention

🎯

Target Hardware

NVIDIA GB10 (Blackwell architecture)
128GB unified CPU/GPU memory
SM 100 compute capability
PCIe Gen5 / NVLink support

System Architecture Overview

flowchart TB subgraph Input["Input Processing"] A[Text Prompt] --> B[Tokenizer] B --> C[Token IDs] end subgraph Memory["Memory Management"] D[(Model Weights
mmap)] E[(Expert Cache
LRU)] F[(KV Cache
Paged)] end subgraph Inference["Inference Pipeline"] C --> G[Embedding Lookup] G --> H[Transformer Layers] H --> I[MoE Routing] I --> J[Expert Computation] J --> K[Attention] K --> L[Output Projection] L --> M[LM Head] M --> N[Sampling] end D --> G D --> H E --> J F --> K N --> O[Generated Token] O -->|Loop| C style Input fill:#e1f5fe style Memory fill:#fff3e0 style Inference fill:#e8f5e9

Supported Models

Model	Parameters	Architecture	Quantization	Status
Qwen2.5-3B	3B	Dense Transformer	Q4_K_M	Tested
Llama 4 Scout	17B (x128 experts)	MoE	Q4_K / Q6_K	In Progress
Mixtral 8x7B	47B	MoE (8 experts)	Q4_K	Planned

Inference Workflow

sequenceDiagram participant U as User participant T as Tokenizer participant E as Engine participant M as Memory Manager participant G as GPU U->>T: Input Prompt T->>E: Token IDs loop For each layer E->>M: Request Weights M->>M: Check mmap cache M->>G: Dequantize Q4K/Q6K G->>G: RMSNorm G->>G: QKV Projection G->>G: RoPE Encoding G->>G: Attention (GQA) alt MoE Layer G->>G: Router (Top-K) G->>M: Load Experts M->>G: Expert FFN else Dense Layer G->>G: FFN (Gate+Up+Down) end G->>G: Residual Add end G->>G: LM Head G->>G: Sampling E->>T: Token ID T->>U: Generated Text

Performance Metrics

~0.1

tok/s (Qwen2.5-3B)

CPU dequant path

128GB

Unified Memory

GB10 Target

Q4_K

4.5 bits/weight

144 bytes/256 elements

256

Max Experts

With LRU caching