wifi-densepose/examples/exo-ai-2025/research/05-memory-mapped-neural-fields/architecture.md

# System Architecture: Demand-Paged Neural Cognition

## Table of Contents
1. [Overview](#overview)
2. [Component Architecture](#component-architecture)
3. [Data Structures](#data-structures)
4. [Algorithms](#algorithms)
5. [Performance Model](#performance-model)
6. [Implementation Plan](#implementation-plan)

---

## Overview

### System Diagram

```
┌───────────────────────────────────────────────────────────────────┐
│                         DPNC Agent                                │
│  ┌─────────────────────────────────────────────────────────────┐ │
│  │  Inference Engine (hot path)                                │ │
│  │  - Query processing                                         │ │
│  │  - SIMD-accelerated inference                              │ │
│  │  - Context assembly                                         │ │
│  └────────────┬────────────────────────────────────────────────┘ │
│               │                                                   │
│  ┌────────────▼────────────────────────────────────────────────┐ │
│  │  Memory Manager                                             │ │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │ │
│  │  │ L1 DRAM  │  │ L2 CXL   │  │ L3 SSD   │  │ L4 HDD   │   │ │
│  │  │  64 GB   │◄─┤ 512 GB   │◄─┤  4 TB    │◄─┤  1 PB    │   │ │
│  │  │ 80ns     │  │ 350ns    │  │ 80μs     │  │ 10ms     │   │ │
│  │  └──────────┘  └──────────┘  └──────────┘  └──────────┘   │ │
│  │        ▲             ▲             ▲             ▲          │ │
│  │        └─────────────┴─────────────┴─────────────┘          │ │
│  │                  Tier Migration Policy                       │ │
│  └────────────┬────────────────────────────────────────────────┘ │
│               │                                                   │
│  ┌────────────▼────────────────────────────────────────────────┐ │
│  │  Prefetch Predictor (Hoeffding Tree)                        │ │
│  │  - Streaming ML model (0.3 MB)                             │ │
│  │  - 97.6% accuracy                                           │ │
│  │  - Async prefetch queue                                     │ │
│  └────────────┬────────────────────────────────────────────────┘ │
│               │                                                   │
│  ┌────────────▼────────────────────────────────────────────────┐ │
│  │  Neural Field Storage                                       │ │
│  │  - Memory-mapped files (mmap)                              │ │
│  │  - Multi-resolution hash encoding                          │ │
│  │  - Sparse distributed addressing                           │ │
│  │  - Lazy evaluation                                          │ │
│  └─────────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────┘
                              │
                              │ I/O
                              ▼
              ┌─────────────────────────────┐
              │  Persistent Storage         │
              │  - NVMe SSD array (10×)     │
              │  - HDD archive              │
              │  - Object storage (S3)      │
              └─────────────────────────────┘
```

---

## Component Architecture

### 1. Inference Engine

**Responsibilities**:
- Process queries from user/application
- Assemble context from multi-tier memory
- Execute neural network inference
- Return results

**Interfaces**:
```rust
pub trait InferenceEngine {
    fn query(&mut self, input: &[f32]) -> Result<Vec<f32>>;
    fn context_size(&self) -> usize;
    fn active_memory(&self) -> usize;
}
```

**Implementation Strategy**:
- **Hot Path Optimization**: Keep inference loop in L1 cache
- **SIMD Kernels**: AVX-512 for matmul, dot products
- **Zero-Copy**: Work directly on mmap'd data
- **Async I/O**: Non-blocking prefetch requests

---

### 2. Memory Manager

**Responsibilities**:
- Manage 4-tier hierarchy (DRAM, CXL, SSD, HDD)
- Page in/out based on access patterns
- Handle page faults (cold misses)
- Coordinate with prefetcher

**Interfaces**:
```rust
pub trait MemoryManager {
    fn load_page(&mut self, addr: u64) -> Result<&[f32]>;
    fn evict_page(&mut self, addr: u64) -> Result<()>;
    fn promote(&mut self, addr: u64, target_tier: Tier) -> Result<()>;
    fn demote(&mut self, addr: u64, target_tier: Tier) -> Result<()>;
}
```

**Tier Migration Policy**:

```rust
enum MigrationPolicy {
    // Promote to faster tier
    Promote {
        trigger: PromoteTrigger,
        target: Tier,
    },

    // Demote to slower tier
    Demote {
        trigger: DemoteTrigger,
        target: Tier,
    },
}

enum PromoteTrigger {
    PredictedAccess(f32),      // Prefetcher confidence
    RecentAccess(Duration),    // Accessed within duration
    HighImportance(f32),       // Semantic importance score
}

enum DemoteTrigger {
    LRU(Duration),             // Not accessed in duration
    CapacityPressure(f32),     // Tier usage > threshold
    LowImportance(f32),        // Semantic importance < threshold
}
```

**Page Replacement Algorithm**:
```rust
fn evict_candidate(tier: Tier) -> PageId {
    // Weighted LRU + semantic importance
    let mut candidates = tier.pages()
        .filter(|p| !p.is_pinned())
        .collect::<Vec<_>>();

    candidates.sort_by_cached_key(|p| {
        let lru_score = (now() - p.last_access).as_secs();
        let importance = 1.0 / (p.importance + 1e-6);
        (lru_score as f32 * importance) as u64
    });

    candidates[0].id
}
```

---

### 3. Prefetch Predictor

**Responsibilities**:
- Predict next N accesses
- Issue async prefetch requests
- Update model via streaming learning
- Track accuracy metrics

**Interfaces**:
```rust
pub trait PrefetchPredictor {
    fn predict(&self, context: &AccessContext) -> Vec<PageId>;
    fn update(&mut self, actual: PageId);
    fn accuracy(&self) -> f32;
}
```

**Hoeffding Tree Implementation**:

```rust
struct HoeffdingTreePredictor {
    tree: HoeffdingTree,
    feature_window: VecDeque<AccessFeatures>,
    predictions: VecDeque<PageId>,
    hits: usize,
    total: usize,
}

impl PrefetchPredictor for HoeffdingTreePredictor {
    fn predict(&self, context: &AccessContext) -> Vec<PageId> {
        // Extract features
        let features = self.extract_features(context);

        // Predict next 5-10 pages
        let mut predictions = Vec::new();
        for _ in 0..10 {
            let page_id = self.tree.predict(&features);
            predictions.push(page_id);
        }

        predictions
    }

    fn update(&mut self, actual: PageId) {
        // Streaming update
        if let Some(predicted) = self.predictions.pop_front() {
            let correct = predicted == actual;
            if correct {
                self.hits += 1;
            }
            self.total += 1;

            // Update tree
            self.tree.partial_fit(&self.feature_window[0], actual);
        }

        // Slide window
        self.feature_window.push_back(AccessFeatures::from(actual));
        if self.feature_window.len() > 10 {
            self.feature_window.pop_front();
        }
    }

    fn accuracy(&self) -> f32 {
        self.hits as f32 / self.total as f32
    }
}
```

**Feature Engineering**:
```rust
struct AccessFeatures {
    current_page: PageId,
    recent_history: [PageId; 10],
    semantic_context: [f32; 128],
    time_of_day: f32,
    query_type: u8,
}

impl AccessFeatures {
    fn extract(context: &AccessContext) -> Self {
        Self {
            current_page: context.current_page,
            recent_history: context.history.last_n(10),
            semantic_context: context.embedding,
            time_of_day: context.timestamp.hour() as f32 / 24.0,
            query_type: context.query_type as u8,
        }
    }
}
```

---

### 4. Neural Field Storage

**Responsibilities**:
- Memory-map petabyte-scale manifolds
- Hash-encode addresses (Instant-NGP style)
- Lazy allocation/evaluation
- Persist changes to disk

**Interfaces**:
```rust
pub trait NeuralFieldStorage {
    fn read(&self, addr: u64, len: usize) -> Result<&[f32]>;
    fn write(&mut self, addr: u64, data: &[f32]) -> Result<()>;
    fn hash_address(&self, concept: &[f32]) -> u64;
    fn flush(&mut self) -> Result<()>;
}
```

**Memory-Mapped Neural Field**:

```rust
pub struct MmapNeuralField {
    // Memory-mapped file
    mmap: MmapMut,

    // Virtual address space size
    virtual_size: usize,

    // Physical backing file
    backing_file: File,

    // Multi-resolution hash tables
    hash_tables: Vec<HashTable>,

    // Access tracking
    access_log: AccessLog,
}

impl MmapNeuralField {
    pub fn new(path: impl AsRef<Path>, virtual_size: usize) -> Result<Self> {
        // Create/open backing file
        let file = OpenOptions::new()
            .read(true)
            .write(true)
            .create(true)
            .open(path)?;

        // Set file size
        file.set_len(virtual_size as u64)?;

        // Memory-map
        let mmap = unsafe { MmapMut::map_mut(&file)? };

        Ok(Self {
            mmap,
            virtual_size,
            backing_file: file,
            hash_tables: Self::init_hash_tables(),
            access_log: AccessLog::new(),
        })
    }

    fn init_hash_tables() -> Vec<HashTable> {
        // Multi-resolution à la Instant-NGP
        vec![
            HashTable::new(1 << 16),   // 64K entries
            HashTable::new(1 << 18),   // 256K entries
            HashTable::new(1 << 20),   // 1M entries
            HashTable::new(1 << 22),   // 4M entries
            HashTable::new(1 << 24),   // 16M entries
        ]
    }
}

impl NeuralFieldStorage for MmapNeuralField {
    fn read(&self, addr: u64, len: usize) -> Result<&[f32]> {
        // Bounds check
        let start = addr as usize;
        let end = start + len * std::mem::size_of::<f32>();
        if end > self.virtual_size {
            return Err(Error::OutOfBounds);
        }

        // Direct access to mmap'd memory
        let slice = &self.mmap[start..end];

        // Reinterpret as f32
        let ptr = slice.as_ptr() as *const f32;
        let data = unsafe { std::slice::from_raw_parts(ptr, len) };

        // Log access
        self.access_log.record(addr);

        Ok(data)
    }

    fn write(&mut self, addr: u64, data: &[f32]) -> Result<()> {
        let start = addr as usize;
        let end = start + data.len() * std::mem::size_of::<f32>();
        if end > self.virtual_size {
            return Err(Error::OutOfBounds);
        }

        // Write to mmap'd memory
        let slice = &mut self.mmap[start..end];
        let ptr = slice.as_mut_ptr() as *mut f32;
        let dest = unsafe { std::slice::from_raw_parts_mut(ptr, data.len()) };
        dest.copy_from_slice(data);

        Ok(())
    }

    fn hash_address(&self, concept: &[f32]) -> u64 {
        // Multi-resolution hashing
        let mut hash = 0u64;
        for (i, table) in self.hash_tables.iter().enumerate() {
            let resolution = 1 << i;
            let quantized = quantize(concept, resolution);
            hash ^= table.hash(&quantized);
        }
        hash % (self.virtual_size as u64 / std::mem::size_of::<f32>() as u64)
    }

    fn flush(&mut self) -> Result<()> {
        // Async flush to disk
        self.mmap.flush_async()?;
        Ok(())
    }
}
```

**Hash Encoding**:

```rust
fn quantize(concept: &[f32], resolution: usize) -> Vec<u8> {
    concept.iter()
        .map(|&x| ((x * resolution as f32).round() as i32).to_le_bytes())
        .flatten()
        .collect()
}

struct HashTable {
    table: Vec<u64>,
}

impl HashTable {
    fn new(size: usize) -> Self {
        Self {
            table: vec![0; size],
        }
    }

    fn hash(&self, data: &[u8]) -> u64 {
        use std::collections::hash_map::DefaultHasher;
        use std::hash::{Hash, Hasher};

        let mut hasher = DefaultHasher::new();
        data.hash(&mut hasher);
        hasher.finish() % self.table.len() as u64
    }
}
```

---

## Data Structures

### Page Descriptor

```rust
struct Page {
    id: PageId,
    tier: Tier,
    data: PageData,
    metadata: PageMetadata,
}

struct PageMetadata {
    size: usize,
    last_access: Instant,
    access_count: usize,
    importance: f32,
    is_dirty: bool,
    is_pinned: bool,
}

enum PageData {
    Resident(Vec<f32>),         // In DRAM
    Mapped(MmapRef),            // Memory-mapped
    Evicted(DiskLocation),      // On disk
}

enum Tier {
    L1Dram,
    L2Cxl,
    L3Ssd,
    L4Hdd,
}
```

### Access Log

```rust
struct AccessLog {
    entries: RingBuffer<AccessEntry>,
    indices: HashMap<PageId, Vec<usize>>,
}

struct AccessEntry {
    page_id: PageId,
    timestamp: Instant,
    latency: Duration,
    tier: Tier,
}

impl AccessLog {
    fn record(&mut self, page_id: PageId, tier: Tier, latency: Duration) {
        let entry = AccessEntry {
            page_id,
            timestamp: Instant::now(),
            latency,
            tier,
        };

        let index = self.entries.push(entry);
        self.indices.entry(page_id)
            .or_insert_with(Vec::new)
            .push(index);
    }

    fn recent_accesses(&self, duration: Duration) -> impl Iterator<Item = &AccessEntry> {
        let cutoff = Instant::now() - duration;
        self.entries.iter()
            .filter(move |e| e.timestamp > cutoff)
    }

    fn access_pattern(&self, page_id: PageId) -> AccessPattern {
        let indices = self.indices.get(&page_id).unwrap_or(&vec![]);
        let accesses: Vec<_> = indices.iter()
            .map(|&i| &self.entries[i])
            .collect();

        AccessPattern::analyze(&accesses)
    }
}
```

---

## Algorithms

### 1. Query Processing

```rust
impl InferenceEngine {
    fn query(&mut self, input: &[f32]) -> Result<Vec<f32>> {
        // 1. Hash input to concept address
        let addr = self.storage.hash_address(input);

        // 2. Check if in memory
        let data = match self.memory_mgr.try_load(addr) {
            Some(d) => d,
            None => {
                // 3. Page fault - load from storage
                self.stats.record_miss();
                self.memory_mgr.load_page(addr)?
            }
        };

        // 4. Predict next accesses
        let context = AccessContext::from_current(addr, input);
        let predictions = self.prefetcher.predict(&context);

        // 5. Async prefetch
        for page_id in predictions {
            self.prefetcher.queue_prefetch(page_id);
        }

        // 6. SIMD-accelerated inference
        let output = self.compute_simd(data, input);

        // 7. Update prefetcher
        self.prefetcher.update(addr);

        Ok(output)
    }

    fn compute_simd(&self, weights: &[f32], input: &[f32]) -> Vec<f32> {
        use std::arch::x86_64::*;

        let mut output = vec![0.0f32; weights.len() / input.len()];

        unsafe {
            for (i, chunk) in weights.chunks_exact(input.len()).enumerate() {
                let mut sum = _mm256_setzero_ps();

                for j in (0..input.len()).step_by(8) {
                    let w = _mm256_loadu_ps(&chunk[j]);
                    let x = _mm256_loadu_ps(&input[j]);
                    sum = _mm256_fmadd_ps(w, x, sum);
                }

                // Horizontal sum
                let sum_arr: [f32; 8] = std::mem::transmute(sum);
                output[i] = sum_arr.iter().sum();
            }
        }

        output
    }
}
```

### 2. Tier Migration

```rust
impl MemoryManager {
    fn migrate_pages(&mut self) {
        // Background task: migrate pages between tiers

        // 1. Identify promotion candidates
        let promote = self.access_log.recent_accesses(Duration::from_secs(60))
            .filter(|e| e.tier != Tier::L1Dram)
            .map(|e| e.page_id)
            .collect::<HashSet<_>>();

        for page_id in promote {
            if let Some(prediction) = self.prefetcher.confidence(page_id) {
                if prediction > 0.8 {
                    self.promote(page_id, Tier::L1Dram)?;
                }
            }
        }

        // 2. Identify demotion candidates
        let demote = self.tiers[Tier::L1Dram]
            .pages()
            .filter(|p| {
                let last_access = Instant::now() - p.last_access;
                last_access > Duration::from_secs(300)
            })
            .map(|p| p.id)
            .collect::<Vec<_>>();

        for page_id in demote {
            self.demote(page_id, Tier::L2Cxl)?;
        }
    }

    fn promote(&mut self, page_id: PageId, target_tier: Tier) -> Result<()> {
        // Load from current tier
        let page = self.load_page(page_id)?;

        // Write to target tier
        self.tiers[target_tier].insert(page_id, page.data.clone())?;

        // Remove from old tier (unless it's persistent storage)
        if page.tier > target_tier {
            self.tiers[page.tier].remove(page_id)?;
        }

        self.stats.record_promotion(page.tier, target_tier);
        Ok(())
    }
}
```

### 3. Prefetch Execution

```rust
impl PrefetchPredictor {
    fn run_prefetch_loop(&mut self) {
        loop {
            // 1. Get next prediction
            let page_id = self.prefetch_queue.pop();

            // 2. Check if already in fast tier
            if self.memory_mgr.is_in_tier(page_id, Tier::L1Dram) {
                continue;
            }

            // 3. Async load
            let handle = self.async_load(page_id);

            // 4. When complete, promote to L1
            self.pending_prefetches.push((page_id, handle));
        }
    }

    fn async_load(&self, page_id: PageId) -> JoinHandle<Vec<f32>> {
        let storage = self.storage.clone();
        std::thread::spawn(move || {
            storage.read_page(page_id).unwrap()
        })
    }
}
```

---

## Performance Model

### Latency Budget

**Target**: 1 ms end-to-end query latency

| Operation | Latency | Budget % |
|-----------|---------|----------|
| Hash address | 100 ns | 0.01% |
| L1 DRAM hit | 80 ns | 0.008% |
| L2 CXL hit | 350 ns | 0.035% |
| L3 SSD hit (prefetched) | 80 μs | 8% |
| L4 HDD hit (cold miss) | 10 ms | 1000% ❌ |
| SIMD inference | 500 μs | 50% |
| Prefetch prediction | 50 μs | 5% |
| Misc overhead | 200 μs | 20% |

**Total (95% L1 hit rate)**:
- 95% × 80 ns = 76 ns
- 4% × 350 ns = 14 ns
- 1% × 80 μs = 800 ns
- Inference: 500 μs
- **Total**: ~500 μs ✅

**Total (with 2.4% L3 miss)**:
- 97.6% × 80 ns = 78 ns
- 2% × 350 ns = 7 ns
- 0.4% × 80 μs = 320 ns
- Inference: 500 μs
- **Total**: ~500 μs ✅

### Throughput Model

**Single-threaded**:
- Queries per second: 1 / 500 μs = **2000 QPS**

**Multi-threaded (16 cores)**:
- Queries per second: 2000 × 16 = **32,000 QPS**

**Batched (batch size 100)**:
- Amortize overhead: 200 μs / 100 = 2 μs per query
- SIMD benefits: 500 μs → 50 μs per query (10× parallelism)
- **Total**: ~130 μs per query → **7,700 QPS per core** → **123,000 QPS (16 cores)**

### Capacity Model

| Tier | Capacity | Active Pages | Page Size | Total |
|------|----------|--------------|-----------|-------|
| L1 | 64 GB | 16K | 4 MB | 64 GB |
| L2 | 512 GB | 128K | 4 MB | 512 GB |
| L3 | 4 TB | 1M | 4 MB | 4 TB |
| L4 | 1 PB | 256M | 4 MB | 1 PB |

**Total Virtual Address Space**: 2^64 bytes = 16 EB

### Energy Model

**Power Consumption**:

| Component | Idle | Active | Average (50% util) |
|-----------|------|--------|--------------------|
| CPU (16 cores) | 50 W | 200 W | 125 W |
| DRAM (64 GB) | 20 W | 40 W | 30 W |
| CXL (512 GB) | 30 W | 60 W | 45 W |
| SSD (10×) | 50 W | 150 W | 100 W |
| HDD (20×) | 40 W | 100 W | 70 W |
| **Total** | **190 W** | **550 W** | **370 W** |

**vs. All-DRAM (1 PB)**:
- 1 PB DRAM: ~300 kW (infeasible)
- DPNC: ~370 W (800× reduction) ✅

---

## Implementation Plan

### Phase 1: Foundation (2 weeks)

**Week 1**: Core data structures
- [ ] `MmapNeuralField` implementation
- [ ] `Page` and `PageMetadata`
- [ ] `AccessLog` ring buffer
- [ ] Basic hash encoding

**Week 2**: Memory management
- [ ] `MemoryManager` with 2 tiers (DRAM, SSD)
- [ ] LRU eviction
- [ ] Sync page load
- [ ] Unit tests

**Deliverable**: Can mmap 10 GB neural field, load pages on demand

---

### Phase 2: Intelligence (2 weeks)

**Week 3**: Prefetch predictor
- [ ] Hoeffding Tree implementation
- [ ] Feature extraction
- [ ] Streaming updates
- [ ] Accuracy tracking

**Week 4**: Async prefetching
- [ ] Prefetch queue
- [ ] Async I/O with `tokio`
- [ ] Integration with memory manager
- [ ] Benchmarks

**Deliverable**: 95%+ prefetch accuracy on synthetic workload

---

### Phase 3: Optimization (2 weeks)

**Week 5**: SIMD acceleration
- [ ] AVX-512 kernels for matmul
- [ ] Zero-copy mmap access
- [ ] Benchmark vs. baseline
- [ ] Profiling and tuning

**Week 6**: Multi-tier
- [ ] Add L2 (CXL or simulated)
- [ ] Add L4 (HDD)
- [ ] Tier migration policies
- [ ] End-to-end benchmarks

**Deliverable**: 8× SIMD speedup, <500 μs query latency

---

### Phase 4: Scale (2 weeks)

**Week 7**: Petabyte scale
- [ ] Sparse hash addressing
- [ ] Multi-SSD parallelism (10× SSDs)
- [ ] Continuous learning for 1 week (24/7)
- [ ] Stability testing

**Week 8**: Production hardening
- [ ] Error handling
- [ ] Crash recovery
- [ ] Monitoring/metrics
- [ ] Documentation

**Deliverable**: 1 PB virtual space, robust production system

---

## Success Metrics

| Metric | Target | Measurement |
|--------|--------|-------------|
| Virtual Capacity | 1 PB | Virtual address space size |
| Physical Footprint | 64 GB DRAM + 4 TB SSD | Actual allocation |
| Query Latency (p50) | <500 μs | Histogram |
| Query Latency (p99) | <5 ms | Histogram |
| Prefetch Accuracy | >95% | Hits / Total |
| Throughput | >10K QPS | Queries per second |
| Energy | <400 W | Power meter |
| SIMD Speedup | >5× | vs. scalar baseline |

---

## Conclusion

This architecture synthesizes cutting-edge techniques from systems, ML, and hardware to achieve **petabyte-scale continuous cognition**. The design is **implementable today** with commodity hardware (NVMe SSDs, DRAM, CPUs with AVX-512).

**Key Innovations**:
1. Memory-mapped neural fields for zero-copy access
2. Multi-tier hierarchy mirroring human memory
3. Predictive prefetching with streaming ML
4. SIMD-accelerated inference on mmap'd data

**Expected Outcome**: A working system demonstrating <1 ms retrieval from 1 PB knowledge manifold.

---

*Architecture designed: 2025-12-04*
*Target: Production deployment 2026-Q2*