Files
wifi-densepose/examples/exo-ai-2025/research/05-memory-mapped-neural-fields/architecture.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

835 lines
24 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# System Architecture: Demand-Paged Neural Cognition
## Table of Contents
1. [Overview](#overview)
2. [Component Architecture](#component-architecture)
3. [Data Structures](#data-structures)
4. [Algorithms](#algorithms)
5. [Performance Model](#performance-model)
6. [Implementation Plan](#implementation-plan)
---
## Overview
### System Diagram
```
┌───────────────────────────────────────────────────────────────────┐
│ DPNC Agent │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Inference Engine (hot path) │ │
│ │ - Query processing │ │
│ │ - SIMD-accelerated inference │ │
│ │ - Context assembly │ │
│ └────────────┬────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────▼────────────────────────────────────────────────┐ │
│ │ Memory Manager │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ L1 DRAM │ │ L2 CXL │ │ L3 SSD │ │ L4 HDD │ │ │
│ │ │ 64 GB │◄─┤ 512 GB │◄─┤ 4 TB │◄─┤ 1 PB │ │ │
│ │ │ 80ns │ │ 350ns │ │ 80μs │ │ 10ms │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │
│ │ ▲ ▲ ▲ ▲ │ │
│ │ └─────────────┴─────────────┴─────────────┘ │ │
│ │ Tier Migration Policy │ │
│ └────────────┬────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────▼────────────────────────────────────────────────┐ │
│ │ Prefetch Predictor (Hoeffding Tree) │ │
│ │ - Streaming ML model (0.3 MB) │ │
│ │ - 97.6% accuracy │ │
│ │ - Async prefetch queue │ │
│ └────────────┬────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────▼────────────────────────────────────────────────┐ │
│ │ Neural Field Storage │ │
│ │ - Memory-mapped files (mmap) │ │
│ │ - Multi-resolution hash encoding │ │
│ │ - Sparse distributed addressing │ │
│ │ - Lazy evaluation │ │
│ └─────────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────┘
│ I/O
┌─────────────────────────────┐
│ Persistent Storage │
│ - NVMe SSD array (10×) │
│ - HDD archive │
│ - Object storage (S3) │
└─────────────────────────────┘
```
---
## Component Architecture
### 1. Inference Engine
**Responsibilities**:
- Process queries from user/application
- Assemble context from multi-tier memory
- Execute neural network inference
- Return results
**Interfaces**:
```rust
pub trait InferenceEngine {
fn query(&mut self, input: &[f32]) -> Result<Vec<f32>>;
fn context_size(&self) -> usize;
fn active_memory(&self) -> usize;
}
```
**Implementation Strategy**:
- **Hot Path Optimization**: Keep inference loop in L1 cache
- **SIMD Kernels**: AVX-512 for matmul, dot products
- **Zero-Copy**: Work directly on mmap'd data
- **Async I/O**: Non-blocking prefetch requests
---
### 2. Memory Manager
**Responsibilities**:
- Manage 4-tier hierarchy (DRAM, CXL, SSD, HDD)
- Page in/out based on access patterns
- Handle page faults (cold misses)
- Coordinate with prefetcher
**Interfaces**:
```rust
pub trait MemoryManager {
fn load_page(&mut self, addr: u64) -> Result<&[f32]>;
fn evict_page(&mut self, addr: u64) -> Result<()>;
fn promote(&mut self, addr: u64, target_tier: Tier) -> Result<()>;
fn demote(&mut self, addr: u64, target_tier: Tier) -> Result<()>;
}
```
**Tier Migration Policy**:
```rust
enum MigrationPolicy {
// Promote to faster tier
Promote {
trigger: PromoteTrigger,
target: Tier,
},
// Demote to slower tier
Demote {
trigger: DemoteTrigger,
target: Tier,
},
}
enum PromoteTrigger {
PredictedAccess(f32), // Prefetcher confidence
RecentAccess(Duration), // Accessed within duration
HighImportance(f32), // Semantic importance score
}
enum DemoteTrigger {
LRU(Duration), // Not accessed in duration
CapacityPressure(f32), // Tier usage > threshold
LowImportance(f32), // Semantic importance < threshold
}
```
**Page Replacement Algorithm**:
```rust
fn evict_candidate(tier: Tier) -> PageId {
// Weighted LRU + semantic importance
let mut candidates = tier.pages()
.filter(|p| !p.is_pinned())
.collect::<Vec<_>>();
candidates.sort_by_cached_key(|p| {
let lru_score = (now() - p.last_access).as_secs();
let importance = 1.0 / (p.importance + 1e-6);
(lru_score as f32 * importance) as u64
});
candidates[0].id
}
```
---
### 3. Prefetch Predictor
**Responsibilities**:
- Predict next N accesses
- Issue async prefetch requests
- Update model via streaming learning
- Track accuracy metrics
**Interfaces**:
```rust
pub trait PrefetchPredictor {
fn predict(&self, context: &AccessContext) -> Vec<PageId>;
fn update(&mut self, actual: PageId);
fn accuracy(&self) -> f32;
}
```
**Hoeffding Tree Implementation**:
```rust
struct HoeffdingTreePredictor {
tree: HoeffdingTree,
feature_window: VecDeque<AccessFeatures>,
predictions: VecDeque<PageId>,
hits: usize,
total: usize,
}
impl PrefetchPredictor for HoeffdingTreePredictor {
fn predict(&self, context: &AccessContext) -> Vec<PageId> {
// Extract features
let features = self.extract_features(context);
// Predict next 5-10 pages
let mut predictions = Vec::new();
for _ in 0..10 {
let page_id = self.tree.predict(&features);
predictions.push(page_id);
}
predictions
}
fn update(&mut self, actual: PageId) {
// Streaming update
if let Some(predicted) = self.predictions.pop_front() {
let correct = predicted == actual;
if correct {
self.hits += 1;
}
self.total += 1;
// Update tree
self.tree.partial_fit(&self.feature_window[0], actual);
}
// Slide window
self.feature_window.push_back(AccessFeatures::from(actual));
if self.feature_window.len() > 10 {
self.feature_window.pop_front();
}
}
fn accuracy(&self) -> f32 {
self.hits as f32 / self.total as f32
}
}
```
**Feature Engineering**:
```rust
struct AccessFeatures {
current_page: PageId,
recent_history: [PageId; 10],
semantic_context: [f32; 128],
time_of_day: f32,
query_type: u8,
}
impl AccessFeatures {
fn extract(context: &AccessContext) -> Self {
Self {
current_page: context.current_page,
recent_history: context.history.last_n(10),
semantic_context: context.embedding,
time_of_day: context.timestamp.hour() as f32 / 24.0,
query_type: context.query_type as u8,
}
}
}
```
---
### 4. Neural Field Storage
**Responsibilities**:
- Memory-map petabyte-scale manifolds
- Hash-encode addresses (Instant-NGP style)
- Lazy allocation/evaluation
- Persist changes to disk
**Interfaces**:
```rust
pub trait NeuralFieldStorage {
fn read(&self, addr: u64, len: usize) -> Result<&[f32]>;
fn write(&mut self, addr: u64, data: &[f32]) -> Result<()>;
fn hash_address(&self, concept: &[f32]) -> u64;
fn flush(&mut self) -> Result<()>;
}
```
**Memory-Mapped Neural Field**:
```rust
pub struct MmapNeuralField {
// Memory-mapped file
mmap: MmapMut,
// Virtual address space size
virtual_size: usize,
// Physical backing file
backing_file: File,
// Multi-resolution hash tables
hash_tables: Vec<HashTable>,
// Access tracking
access_log: AccessLog,
}
impl MmapNeuralField {
pub fn new(path: impl AsRef<Path>, virtual_size: usize) -> Result<Self> {
// Create/open backing file
let file = OpenOptions::new()
.read(true)
.write(true)
.create(true)
.open(path)?;
// Set file size
file.set_len(virtual_size as u64)?;
// Memory-map
let mmap = unsafe { MmapMut::map_mut(&file)? };
Ok(Self {
mmap,
virtual_size,
backing_file: file,
hash_tables: Self::init_hash_tables(),
access_log: AccessLog::new(),
})
}
fn init_hash_tables() -> Vec<HashTable> {
// Multi-resolution à la Instant-NGP
vec![
HashTable::new(1 << 16), // 64K entries
HashTable::new(1 << 18), // 256K entries
HashTable::new(1 << 20), // 1M entries
HashTable::new(1 << 22), // 4M entries
HashTable::new(1 << 24), // 16M entries
]
}
}
impl NeuralFieldStorage for MmapNeuralField {
fn read(&self, addr: u64, len: usize) -> Result<&[f32]> {
// Bounds check
let start = addr as usize;
let end = start + len * std::mem::size_of::<f32>();
if end > self.virtual_size {
return Err(Error::OutOfBounds);
}
// Direct access to mmap'd memory
let slice = &self.mmap[start..end];
// Reinterpret as f32
let ptr = slice.as_ptr() as *const f32;
let data = unsafe { std::slice::from_raw_parts(ptr, len) };
// Log access
self.access_log.record(addr);
Ok(data)
}
fn write(&mut self, addr: u64, data: &[f32]) -> Result<()> {
let start = addr as usize;
let end = start + data.len() * std::mem::size_of::<f32>();
if end > self.virtual_size {
return Err(Error::OutOfBounds);
}
// Write to mmap'd memory
let slice = &mut self.mmap[start..end];
let ptr = slice.as_mut_ptr() as *mut f32;
let dest = unsafe { std::slice::from_raw_parts_mut(ptr, data.len()) };
dest.copy_from_slice(data);
Ok(())
}
fn hash_address(&self, concept: &[f32]) -> u64 {
// Multi-resolution hashing
let mut hash = 0u64;
for (i, table) in self.hash_tables.iter().enumerate() {
let resolution = 1 << i;
let quantized = quantize(concept, resolution);
hash ^= table.hash(&quantized);
}
hash % (self.virtual_size as u64 / std::mem::size_of::<f32>() as u64)
}
fn flush(&mut self) -> Result<()> {
// Async flush to disk
self.mmap.flush_async()?;
Ok(())
}
}
```
**Hash Encoding**:
```rust
fn quantize(concept: &[f32], resolution: usize) -> Vec<u8> {
concept.iter()
.map(|&x| ((x * resolution as f32).round() as i32).to_le_bytes())
.flatten()
.collect()
}
struct HashTable {
table: Vec<u64>,
}
impl HashTable {
fn new(size: usize) -> Self {
Self {
table: vec![0; size],
}
}
fn hash(&self, data: &[u8]) -> u64 {
use std::collections::hash_map::DefaultHasher;
use std::hash::{Hash, Hasher};
let mut hasher = DefaultHasher::new();
data.hash(&mut hasher);
hasher.finish() % self.table.len() as u64
}
}
```
---
## Data Structures
### Page Descriptor
```rust
struct Page {
id: PageId,
tier: Tier,
data: PageData,
metadata: PageMetadata,
}
struct PageMetadata {
size: usize,
last_access: Instant,
access_count: usize,
importance: f32,
is_dirty: bool,
is_pinned: bool,
}
enum PageData {
Resident(Vec<f32>), // In DRAM
Mapped(MmapRef), // Memory-mapped
Evicted(DiskLocation), // On disk
}
enum Tier {
L1Dram,
L2Cxl,
L3Ssd,
L4Hdd,
}
```
### Access Log
```rust
struct AccessLog {
entries: RingBuffer<AccessEntry>,
indices: HashMap<PageId, Vec<usize>>,
}
struct AccessEntry {
page_id: PageId,
timestamp: Instant,
latency: Duration,
tier: Tier,
}
impl AccessLog {
fn record(&mut self, page_id: PageId, tier: Tier, latency: Duration) {
let entry = AccessEntry {
page_id,
timestamp: Instant::now(),
latency,
tier,
};
let index = self.entries.push(entry);
self.indices.entry(page_id)
.or_insert_with(Vec::new)
.push(index);
}
fn recent_accesses(&self, duration: Duration) -> impl Iterator<Item = &AccessEntry> {
let cutoff = Instant::now() - duration;
self.entries.iter()
.filter(move |e| e.timestamp > cutoff)
}
fn access_pattern(&self, page_id: PageId) -> AccessPattern {
let indices = self.indices.get(&page_id).unwrap_or(&vec![]);
let accesses: Vec<_> = indices.iter()
.map(|&i| &self.entries[i])
.collect();
AccessPattern::analyze(&accesses)
}
}
```
---
## Algorithms
### 1. Query Processing
```rust
impl InferenceEngine {
fn query(&mut self, input: &[f32]) -> Result<Vec<f32>> {
// 1. Hash input to concept address
let addr = self.storage.hash_address(input);
// 2. Check if in memory
let data = match self.memory_mgr.try_load(addr) {
Some(d) => d,
None => {
// 3. Page fault - load from storage
self.stats.record_miss();
self.memory_mgr.load_page(addr)?
}
};
// 4. Predict next accesses
let context = AccessContext::from_current(addr, input);
let predictions = self.prefetcher.predict(&context);
// 5. Async prefetch
for page_id in predictions {
self.prefetcher.queue_prefetch(page_id);
}
// 6. SIMD-accelerated inference
let output = self.compute_simd(data, input);
// 7. Update prefetcher
self.prefetcher.update(addr);
Ok(output)
}
fn compute_simd(&self, weights: &[f32], input: &[f32]) -> Vec<f32> {
use std::arch::x86_64::*;
let mut output = vec![0.0f32; weights.len() / input.len()];
unsafe {
for (i, chunk) in weights.chunks_exact(input.len()).enumerate() {
let mut sum = _mm256_setzero_ps();
for j in (0..input.len()).step_by(8) {
let w = _mm256_loadu_ps(&chunk[j]);
let x = _mm256_loadu_ps(&input[j]);
sum = _mm256_fmadd_ps(w, x, sum);
}
// Horizontal sum
let sum_arr: [f32; 8] = std::mem::transmute(sum);
output[i] = sum_arr.iter().sum();
}
}
output
}
}
```
### 2. Tier Migration
```rust
impl MemoryManager {
fn migrate_pages(&mut self) {
// Background task: migrate pages between tiers
// 1. Identify promotion candidates
let promote = self.access_log.recent_accesses(Duration::from_secs(60))
.filter(|e| e.tier != Tier::L1Dram)
.map(|e| e.page_id)
.collect::<HashSet<_>>();
for page_id in promote {
if let Some(prediction) = self.prefetcher.confidence(page_id) {
if prediction > 0.8 {
self.promote(page_id, Tier::L1Dram)?;
}
}
}
// 2. Identify demotion candidates
let demote = self.tiers[Tier::L1Dram]
.pages()
.filter(|p| {
let last_access = Instant::now() - p.last_access;
last_access > Duration::from_secs(300)
})
.map(|p| p.id)
.collect::<Vec<_>>();
for page_id in demote {
self.demote(page_id, Tier::L2Cxl)?;
}
}
fn promote(&mut self, page_id: PageId, target_tier: Tier) -> Result<()> {
// Load from current tier
let page = self.load_page(page_id)?;
// Write to target tier
self.tiers[target_tier].insert(page_id, page.data.clone())?;
// Remove from old tier (unless it's persistent storage)
if page.tier > target_tier {
self.tiers[page.tier].remove(page_id)?;
}
self.stats.record_promotion(page.tier, target_tier);
Ok(())
}
}
```
### 3. Prefetch Execution
```rust
impl PrefetchPredictor {
fn run_prefetch_loop(&mut self) {
loop {
// 1. Get next prediction
let page_id = self.prefetch_queue.pop();
// 2. Check if already in fast tier
if self.memory_mgr.is_in_tier(page_id, Tier::L1Dram) {
continue;
}
// 3. Async load
let handle = self.async_load(page_id);
// 4. When complete, promote to L1
self.pending_prefetches.push((page_id, handle));
}
}
fn async_load(&self, page_id: PageId) -> JoinHandle<Vec<f32>> {
let storage = self.storage.clone();
std::thread::spawn(move || {
storage.read_page(page_id).unwrap()
})
}
}
```
---
## Performance Model
### Latency Budget
**Target**: 1 ms end-to-end query latency
| Operation | Latency | Budget % |
|-----------|---------|----------|
| Hash address | 100 ns | 0.01% |
| L1 DRAM hit | 80 ns | 0.008% |
| L2 CXL hit | 350 ns | 0.035% |
| L3 SSD hit (prefetched) | 80 μs | 8% |
| L4 HDD hit (cold miss) | 10 ms | 1000% ❌ |
| SIMD inference | 500 μs | 50% |
| Prefetch prediction | 50 μs | 5% |
| Misc overhead | 200 μs | 20% |
**Total (95% L1 hit rate)**:
- 95% × 80 ns = 76 ns
- 4% × 350 ns = 14 ns
- 1% × 80 μs = 800 ns
- Inference: 500 μs
- **Total**: ~500 μs ✅
**Total (with 2.4% L3 miss)**:
- 97.6% × 80 ns = 78 ns
- 2% × 350 ns = 7 ns
- 0.4% × 80 μs = 320 ns
- Inference: 500 μs
- **Total**: ~500 μs ✅
### Throughput Model
**Single-threaded**:
- Queries per second: 1 / 500 μs = **2000 QPS**
**Multi-threaded (16 cores)**:
- Queries per second: 2000 × 16 = **32,000 QPS**
**Batched (batch size 100)**:
- Amortize overhead: 200 μs / 100 = 2 μs per query
- SIMD benefits: 500 μs → 50 μs per query (10× parallelism)
- **Total**: ~130 μs per query → **7,700 QPS per core****123,000 QPS (16 cores)**
### Capacity Model
| Tier | Capacity | Active Pages | Page Size | Total |
|------|----------|--------------|-----------|-------|
| L1 | 64 GB | 16K | 4 MB | 64 GB |
| L2 | 512 GB | 128K | 4 MB | 512 GB |
| L3 | 4 TB | 1M | 4 MB | 4 TB |
| L4 | 1 PB | 256M | 4 MB | 1 PB |
**Total Virtual Address Space**: 2^64 bytes = 16 EB
### Energy Model
**Power Consumption**:
| Component | Idle | Active | Average (50% util) |
|-----------|------|--------|--------------------|
| CPU (16 cores) | 50 W | 200 W | 125 W |
| DRAM (64 GB) | 20 W | 40 W | 30 W |
| CXL (512 GB) | 30 W | 60 W | 45 W |
| SSD (10×) | 50 W | 150 W | 100 W |
| HDD (20×) | 40 W | 100 W | 70 W |
| **Total** | **190 W** | **550 W** | **370 W** |
**vs. All-DRAM (1 PB)**:
- 1 PB DRAM: ~300 kW (infeasible)
- DPNC: ~370 W (800× reduction) ✅
---
## Implementation Plan
### Phase 1: Foundation (2 weeks)
**Week 1**: Core data structures
- [ ] `MmapNeuralField` implementation
- [ ] `Page` and `PageMetadata`
- [ ] `AccessLog` ring buffer
- [ ] Basic hash encoding
**Week 2**: Memory management
- [ ] `MemoryManager` with 2 tiers (DRAM, SSD)
- [ ] LRU eviction
- [ ] Sync page load
- [ ] Unit tests
**Deliverable**: Can mmap 10 GB neural field, load pages on demand
---
### Phase 2: Intelligence (2 weeks)
**Week 3**: Prefetch predictor
- [ ] Hoeffding Tree implementation
- [ ] Feature extraction
- [ ] Streaming updates
- [ ] Accuracy tracking
**Week 4**: Async prefetching
- [ ] Prefetch queue
- [ ] Async I/O with `tokio`
- [ ] Integration with memory manager
- [ ] Benchmarks
**Deliverable**: 95%+ prefetch accuracy on synthetic workload
---
### Phase 3: Optimization (2 weeks)
**Week 5**: SIMD acceleration
- [ ] AVX-512 kernels for matmul
- [ ] Zero-copy mmap access
- [ ] Benchmark vs. baseline
- [ ] Profiling and tuning
**Week 6**: Multi-tier
- [ ] Add L2 (CXL or simulated)
- [ ] Add L4 (HDD)
- [ ] Tier migration policies
- [ ] End-to-end benchmarks
**Deliverable**: 8× SIMD speedup, <500 μs query latency
---
### Phase 4: Scale (2 weeks)
**Week 7**: Petabyte scale
- [ ] Sparse hash addressing
- [ ] Multi-SSD parallelism (10× SSDs)
- [ ] Continuous learning for 1 week (24/7)
- [ ] Stability testing
**Week 8**: Production hardening
- [ ] Error handling
- [ ] Crash recovery
- [ ] Monitoring/metrics
- [ ] Documentation
**Deliverable**: 1 PB virtual space, robust production system
---
## Success Metrics
| Metric | Target | Measurement |
|--------|--------|-------------|
| Virtual Capacity | 1 PB | Virtual address space size |
| Physical Footprint | 64 GB DRAM + 4 TB SSD | Actual allocation |
| Query Latency (p50) | <500 μs | Histogram |
| Query Latency (p99) | <5 ms | Histogram |
| Prefetch Accuracy | >95% | Hits / Total |
| Throughput | >10K QPS | Queries per second |
| Energy | <400 W | Power meter |
| SIMD Speedup | >5× | vs. scalar baseline |
---
## Conclusion
This architecture synthesizes cutting-edge techniques from systems, ML, and hardware to achieve **petabyte-scale continuous cognition**. The design is **implementable today** with commodity hardware (NVMe SSDs, DRAM, CPUs with AVX-512).
**Key Innovations**:
1. Memory-mapped neural fields for zero-copy access
2. Multi-tier hierarchy mirroring human memory
3. Predictive prefetching with streaming ML
4. SIMD-accelerated inference on mmap'd data
**Expected Outcome**: A working system demonstrating <1 ms retrieval from 1 PB knowledge manifold.
---
*Architecture designed: 2025-12-04*
*Target: Production deployment 2026-Q2*