HugeModel Inference Engine

Run Large Language Models Beyond GPU Memory Limits

A high-performance C++/CUDA inference engine designed for Mixture-of-Experts models on NVIDIA GB10 with unified memory architecture.

Project Overview

Problem Statement

Modern LLMs like Llama 4 (400B+ parameters) exceed typical GPU memory (24-80GB). Traditional solutions require expensive multi-GPU setups or significant model compression.

Our Solution

HugeModel leverages unified memory on NVIDIA GB10 (128GB shared CPU/GPU memory) combined with MoE architecture to efficiently run models that would otherwise be impossible on single-GPU systems.

Key Innovation

Memory-mapped weights with on-demand dequantization, intelligent expert caching, and optimized cuBLAS operations enable efficient inference with minimal memory footprint.

Key Features

💾

Memory Efficiency

  • Memory-mapped model weights (zero-copy loading)
  • Q4_K and Q6_K quantization support
  • Dynamic expert loading with LRU caching
  • Paged KV Cache for long sequences

Performance

  • FP16 computation with FP32 accumulation
  • Optimized GEMM via cuBLAS
  • RoPE positional encoding
  • Grouped Query Attention
🎯

Target Hardware

  • NVIDIA GB10 (Blackwell architecture)
  • 128GB unified CPU/GPU memory
  • SM 100 compute capability
  • PCIe Gen5 / NVLink support

System Architecture Overview

flowchart TB subgraph Input["Input Processing"] A[Text Prompt] --> B[Tokenizer] B --> C[Token IDs] end subgraph Memory["Memory Management"] D[(Model Weights
mmap)] E[(Expert Cache
LRU)] F[(KV Cache
Paged)] end subgraph Inference["Inference Pipeline"] C --> G[Embedding Lookup] G --> H[Transformer Layers] H --> I[MoE Routing] I --> J[Expert Computation] J --> K[Attention] K --> L[Output Projection] L --> M[LM Head] M --> N[Sampling] end D --> G D --> H E --> J F --> K N --> O[Generated Token] O -->|Loop| C style Input fill:#e1f5fe style Memory fill:#fff3e0 style Inference fill:#e8f5e9

Supported Models

Model Parameters Architecture Quantization Status
Qwen2.5-3B 3B Dense Transformer Q4_K_M Tested
Llama 4 Scout 17B (x128 experts) MoE Q4_K / Q6_K In Progress
Mixtral 8x7B 47B MoE (8 experts) Q4_K Planned

Inference Workflow

sequenceDiagram participant U as User participant T as Tokenizer participant E as Engine participant M as Memory Manager participant G as GPU U->>T: Input Prompt T->>E: Token IDs loop For each layer E->>M: Request Weights M->>M: Check mmap cache M->>G: Dequantize Q4K/Q6K G->>G: RMSNorm G->>G: QKV Projection G->>G: RoPE Encoding G->>G: Attention (GQA) alt MoE Layer G->>G: Router (Top-K) G->>M: Load Experts M->>G: Expert FFN else Dense Layer G->>G: FFN (Gate+Up+Down) end G->>G: Residual Add end G->>G: LM Head G->>G: Sampling E->>T: Token ID T->>U: Generated Text

Performance Metrics

~0.1
tok/s (Qwen2.5-3B)
CPU dequant path
128GB
Unified Memory
GB10 Target
Q4_K
4.5 bits/weight
144 bytes/256 elements
256
Max Experts
With LRU caching