Project Structure

HugeModel/
├── include/                    # Header files
│   ├── common/                 # Type definitions, utilities
│   │   └── types.hpp          # u8, u16, f32, i32, etc.
│   ├── engine/                 # Core engine interfaces
│   │   ├── inference_engine.hpp
│   │   └── weight_loader.hpp
│   ├── memory/                 # Memory management
│   │   ├── memory_pool.hpp
│   │   └── paged_kv_cache.hpp
│   ├── moe/                    # Mixture-of-Experts
│   │   └── expert_cache.hpp
│   ├── kernels/                # CUDA kernel headers
│   │   ├── embedding.cuh
│   │   ├── attention.cuh
│   │   ├── mlp.cuh
│   │   └── sampling.cuh
│   └── tokenizer/              # Tokenization
│       └── tokenizer.hpp
├── src/                        # Implementation
│   ├── engine/                 # Main inference logic (~4000 lines)
│   ├── kernels/                # CUDA kernels
│   ├── memory/                 # Memory managers
│   └── tokenizer/              # Tokenizer implementation
├── lib/
│   └── layer_prefetch/         # Async GPU dequantization library
└── tools/                      # Debug and analysis scripts

Core Components

classDiagram class InferenceEngine { -WeightLoader weightLoader -MemoryManager memManager -ExpertCache expertCache -PagedKVCache kvCache -CublasManager cublas +initialize() bool +generate(prompt, maxTokens) string +runPrefill(tokens) void +runDecode() i32 } class WeightLoader { -void* mmapData -ModelConfig config +open(path) bool +getTensor(name) TensorView +modelConfig() ModelConfig } class MemoryManager { -GPUPool gpuPool -size_t allocated +allocateGPU(size) GPUPtr +allocateUnified(size) UnifiedPtr +freeGPU(ptr) void } class ExpertCache { -LRUCache cache -size_t maxExperts +getExpert(layer, id) Expert +evictLRU() void +stats() CacheStats } class PagedKVCache { -vector~Page~ pages -size_t pageSize +appendKV(layer, k, v) void +getKV(layer, range) KVSlice } InferenceEngine --> WeightLoader InferenceEngine --> MemoryManager InferenceEngine --> ExpertCache InferenceEngine --> PagedKVCache

Memory Architecture

flowchart LR subgraph Host["Host Memory"] A[Model File
.hm format] B[mmap Region
Zero-copy] end subgraph Unified["Unified Memory (128GB)"] C[Hidden States] D[QKV Buffers] E[Attention Output] F[MLP Intermediates] end subgraph GPU["GPU Memory"] G[Dequantized Weights] H[KV Cache] I[Expert Cache] J[Logits Buffer] end A -->|mmap| B B -->|PCIe/NVLink| G C <-->|Unified| D D <-->|Unified| E G --> H G --> I E --> J style Host fill:#ffecb3 style Unified fill:#c8e6c9 style GPU fill:#bbdefb

Memory Layout Details

Component Location Size (Qwen2.5-3B) Purpose
Model Weights mmap ~2GB (Q4_K) Quantized parameters
Hidden States Unified seq_len * 2048 * 2B Layer activations
KV Cache GPU/Unified layers * seq * kv_dim * 2B Attention cache
Expert Cache GPU Configurable (LRU) Hot experts

Transformer Layer Flow

flowchart TB subgraph Layer["Transformer Layer"] A[Input Hidden States] --> B[RMSNorm] B --> C[QKV Projection] subgraph Attention["Multi-Head Attention"] C --> D[Q Matrix] C --> E[K Matrix] C --> F[V Matrix] D --> G[RoPE Q] E --> H[RoPE K] H --> I[KV Cache Append] G --> J[Attention Scores] I --> J J --> K[Softmax] K --> L[Attention @ V] end L --> M[Output Projection] M --> N[Residual Add] N --> O[RMSNorm] subgraph FFN["Feed-Forward / MoE"] O --> P{MoE Layer?} P -->|Yes| Q[Router] Q --> R[Top-K Experts] R --> S[Expert FFN] P -->|No| T[Dense FFN] S --> U[Combine] T --> U end U --> V[Residual Add] V --> W[Output Hidden States] end style Attention fill:#e3f2fd style FFN fill:#fce4ec

Quantization Formats

Q4_K Format

~4.5 bits per weight

d FP16 (2B)
dmin FP16 (2B)
scales 12 bytes
qs 128 bytes

Total: 144 bytes / 256 elements

value = d * scale * (nibble & 0xF) - dmin * min

Q6_K Format

~6.5 bits per weight

ql 128 bytes
qh 64 bytes
scales 16 bytes (i8)
d FP16 (2B)

Total: 210 bytes / 256 elements

value = d * scale * ((ql | (qh << 4)) - 32)

Expert Cache Architecture

flowchart TB subgraph Router["MoE Router"] A[Hidden State] --> B[Gate Projection] B --> C[Softmax/Sigmoid] C --> D[Top-K Selection] end subgraph Cache["Expert Cache (LRU)"] D --> E{Expert in Cache?} E -->|Yes| F[Cache Hit] E -->|No| G[Cache Miss] G --> H[Load from mmap] H --> I[Dequantize GPU] I --> J[Insert to Cache] J --> K{Cache Full?} K -->|Yes| L[Evict LRU] F --> M[Expert Weights] J --> M end subgraph Compute["Expert Computation"] M --> N[Gate * Up] N --> O[SiLU Activation] O --> P[Down Projection] end P --> Q[Weighted Sum] Q --> R[Output] style Router fill:#fff3e0 style Cache fill:#e8f5e9 style Compute fill:#e3f2fd

CUDA Kernel Organization

Kernel File Functions Description
embedding.cu embeddingLookup
applyRoPEInline
precomputeRoPEFreqs
Token embedding and positional encoding
attention.cu ropeKernel
softmaxKernel
attentionKernel
Attention mechanism components
mlp.cu siluKernel
gemmFP16_FP32Accum
expertMLP
Feed-forward and MoE computation
rms_norm.cu rmsNormKernel RMS normalization
sampling.cu greedyKernel
softmaxKernel
multinomialSample
Token sampling strategies
dequant_kernels.cu dequantQ4K
dequantQ6K
GPU-accelerated dequantization

Data Flow Diagram

flowchart LR subgraph Disk["Storage"] A[(.hm Model File)] end subgraph CPU["CPU Domain"] B[mmap Mapping] C[Tokenizer] D[Weight Loader] end subgraph Transfer["PCIe/NVLink"] E[DMA Transfer] end subgraph GPU["GPU Domain"] F[Dequant Kernels] G[GEMM cuBLAS] H[Attention] I[MLP/MoE] J[Sampling] end A --> B B --> D D --> E E --> F F --> G G --> H H --> I I --> G G --> J C --> D J --> C style Disk fill:#ffcc80 style CPU fill:#b3e5fc style Transfer fill:#c5e1a5 style GPU fill:#f8bbd9