System Architecture
Detailed overview of HugeModel's internal structure and data flow
Project Structure
HugeModel/ ├── include/ # Header files │ ├── common/ # Type definitions, utilities │ │ └── types.hpp # u8, u16, f32, i32, etc. │ ├── engine/ # Core engine interfaces │ │ ├── inference_engine.hpp │ │ └── weight_loader.hpp │ ├── memory/ # Memory management │ │ ├── memory_pool.hpp │ │ └── paged_kv_cache.hpp │ ├── moe/ # Mixture-of-Experts │ │ └── expert_cache.hpp │ ├── kernels/ # CUDA kernel headers │ │ ├── embedding.cuh │ │ ├── attention.cuh │ │ ├── mlp.cuh │ │ └── sampling.cuh │ └── tokenizer/ # Tokenization │ └── tokenizer.hpp ├── src/ # Implementation │ ├── engine/ # Main inference logic (~4000 lines) │ ├── kernels/ # CUDA kernels │ ├── memory/ # Memory managers │ └── tokenizer/ # Tokenizer implementation ├── lib/ │ └── layer_prefetch/ # Async GPU dequantization library └── tools/ # Debug and analysis scripts
Core Components
classDiagram
class InferenceEngine {
-WeightLoader weightLoader
-MemoryManager memManager
-ExpertCache expertCache
-PagedKVCache kvCache
-CublasManager cublas
+initialize() bool
+generate(prompt, maxTokens) string
+runPrefill(tokens) void
+runDecode() i32
}
class WeightLoader {
-void* mmapData
-ModelConfig config
+open(path) bool
+getTensor(name) TensorView
+modelConfig() ModelConfig
}
class MemoryManager {
-GPUPool gpuPool
-size_t allocated
+allocateGPU(size) GPUPtr
+allocateUnified(size) UnifiedPtr
+freeGPU(ptr) void
}
class ExpertCache {
-LRUCache cache
-size_t maxExperts
+getExpert(layer, id) Expert
+evictLRU() void
+stats() CacheStats
}
class PagedKVCache {
-vector~Page~ pages
-size_t pageSize
+appendKV(layer, k, v) void
+getKV(layer, range) KVSlice
}
InferenceEngine --> WeightLoader
InferenceEngine --> MemoryManager
InferenceEngine --> ExpertCache
InferenceEngine --> PagedKVCache
Memory Architecture
flowchart LR
subgraph Host["Host Memory"]
A[Model File
.hm format] B[mmap Region
Zero-copy] end subgraph Unified["Unified Memory (128GB)"] C[Hidden States] D[QKV Buffers] E[Attention Output] F[MLP Intermediates] end subgraph GPU["GPU Memory"] G[Dequantized Weights] H[KV Cache] I[Expert Cache] J[Logits Buffer] end A -->|mmap| B B -->|PCIe/NVLink| G C <-->|Unified| D D <-->|Unified| E G --> H G --> I E --> J style Host fill:#ffecb3 style Unified fill:#c8e6c9 style GPU fill:#bbdefb
.hm format] B[mmap Region
Zero-copy] end subgraph Unified["Unified Memory (128GB)"] C[Hidden States] D[QKV Buffers] E[Attention Output] F[MLP Intermediates] end subgraph GPU["GPU Memory"] G[Dequantized Weights] H[KV Cache] I[Expert Cache] J[Logits Buffer] end A -->|mmap| B B -->|PCIe/NVLink| G C <-->|Unified| D D <-->|Unified| E G --> H G --> I E --> J style Host fill:#ffecb3 style Unified fill:#c8e6c9 style GPU fill:#bbdefb
Memory Layout Details
| Component | Location | Size (Qwen2.5-3B) | Purpose |
|---|---|---|---|
| Model Weights | mmap | ~2GB (Q4_K) | Quantized parameters |
| Hidden States | Unified | seq_len * 2048 * 2B | Layer activations |
| KV Cache | GPU/Unified | layers * seq * kv_dim * 2B | Attention cache |
| Expert Cache | GPU | Configurable (LRU) | Hot experts |
Transformer Layer Flow
flowchart TB
subgraph Layer["Transformer Layer"]
A[Input Hidden States] --> B[RMSNorm]
B --> C[QKV Projection]
subgraph Attention["Multi-Head Attention"]
C --> D[Q Matrix]
C --> E[K Matrix]
C --> F[V Matrix]
D --> G[RoPE Q]
E --> H[RoPE K]
H --> I[KV Cache Append]
G --> J[Attention Scores]
I --> J
J --> K[Softmax]
K --> L[Attention @ V]
end
L --> M[Output Projection]
M --> N[Residual Add]
N --> O[RMSNorm]
subgraph FFN["Feed-Forward / MoE"]
O --> P{MoE Layer?}
P -->|Yes| Q[Router]
Q --> R[Top-K Experts]
R --> S[Expert FFN]
P -->|No| T[Dense FFN]
S --> U[Combine]
T --> U
end
U --> V[Residual Add]
V --> W[Output Hidden States]
end
style Attention fill:#e3f2fd
style FFN fill:#fce4ec
Quantization Formats
Q4_K Format
~4.5 bits per weight
d
FP16 (2B)
dmin
FP16 (2B)
scales
12 bytes
qs
128 bytes
Total: 144 bytes / 256 elements
value = d * scale * (nibble & 0xF) - dmin * min
Q6_K Format
~6.5 bits per weight
ql
128 bytes
qh
64 bytes
scales
16 bytes (i8)
d
FP16 (2B)
Total: 210 bytes / 256 elements
value = d * scale * ((ql | (qh << 4)) - 32)
Expert Cache Architecture
flowchart TB
subgraph Router["MoE Router"]
A[Hidden State] --> B[Gate Projection]
B --> C[Softmax/Sigmoid]
C --> D[Top-K Selection]
end
subgraph Cache["Expert Cache (LRU)"]
D --> E{Expert in Cache?}
E -->|Yes| F[Cache Hit]
E -->|No| G[Cache Miss]
G --> H[Load from mmap]
H --> I[Dequantize GPU]
I --> J[Insert to Cache]
J --> K{Cache Full?}
K -->|Yes| L[Evict LRU]
F --> M[Expert Weights]
J --> M
end
subgraph Compute["Expert Computation"]
M --> N[Gate * Up]
N --> O[SiLU Activation]
O --> P[Down Projection]
end
P --> Q[Weighted Sum]
Q --> R[Output]
style Router fill:#fff3e0
style Cache fill:#e8f5e9
style Compute fill:#e3f2fd
CUDA Kernel Organization
| Kernel File | Functions | Description |
|---|---|---|
embedding.cu |
embeddingLookupapplyRoPEInlineprecomputeRoPEFreqs
|
Token embedding and positional encoding |
attention.cu |
ropeKernelsoftmaxKernelattentionKernel
|
Attention mechanism components |
mlp.cu |
siluKernelgemmFP16_FP32AccumexpertMLP
|
Feed-forward and MoE computation |
rms_norm.cu |
rmsNormKernel
|
RMS normalization |
sampling.cu |
greedyKernelsoftmaxKernelmultinomialSample
|
Token sampling strategies |
dequant_kernels.cu |
dequantQ4KdequantQ6K
|
GPU-accelerated dequantization |
Data Flow Diagram
flowchart LR
subgraph Disk["Storage"]
A[(.hm Model File)]
end
subgraph CPU["CPU Domain"]
B[mmap Mapping]
C[Tokenizer]
D[Weight Loader]
end
subgraph Transfer["PCIe/NVLink"]
E[DMA Transfer]
end
subgraph GPU["GPU Domain"]
F[Dequant Kernels]
G[GEMM cuBLAS]
H[Attention]
I[MLP/MoE]
J[Sampling]
end
A --> B
B --> D
D --> E
E --> F
F --> G
G --> H
H --> I
I --> G
G --> J
C --> D
J --> C
style Disk fill:#ffcc80
style CPU fill:#b3e5fc
style Transfer fill:#c5e1a5
style GPU fill:#f8bbd9