Files
wifi-densepose/examples/exo-ai-2025/research/05-memory-mapped-neural-fields/architecture.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

24 KiB
Raw Blame History

System Architecture: Demand-Paged Neural Cognition

Table of Contents

  1. Overview
  2. Component Architecture
  3. Data Structures
  4. Algorithms
  5. Performance Model
  6. Implementation Plan

Overview

System Diagram

┌───────────────────────────────────────────────────────────────────┐
│                         DPNC Agent                                │
│  ┌─────────────────────────────────────────────────────────────┐ │
│  │  Inference Engine (hot path)                                │ │
│  │  - Query processing                                         │ │
│  │  - SIMD-accelerated inference                              │ │
│  │  - Context assembly                                         │ │
│  └────────────┬────────────────────────────────────────────────┘ │
│               │                                                   │
│  ┌────────────▼────────────────────────────────────────────────┐ │
│  │  Memory Manager                                             │ │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │ │
│  │  │ L1 DRAM  │  │ L2 CXL   │  │ L3 SSD   │  │ L4 HDD   │   │ │
│  │  │  64 GB   │◄─┤ 512 GB   │◄─┤  4 TB    │◄─┤  1 PB    │   │ │
│  │  │ 80ns     │  │ 350ns    │  │ 80μs     │  │ 10ms     │   │ │
│  │  └──────────┘  └──────────┘  └──────────┘  └──────────┘   │ │
│  │        ▲             ▲             ▲             ▲          │ │
│  │        └─────────────┴─────────────┴─────────────┘          │ │
│  │                  Tier Migration Policy                       │ │
│  └────────────┬────────────────────────────────────────────────┘ │
│               │                                                   │
│  ┌────────────▼────────────────────────────────────────────────┐ │
│  │  Prefetch Predictor (Hoeffding Tree)                        │ │
│  │  - Streaming ML model (0.3 MB)                             │ │
│  │  - 97.6% accuracy                                           │ │
│  │  - Async prefetch queue                                     │ │
│  └────────────┬────────────────────────────────────────────────┘ │
│               │                                                   │
│  ┌────────────▼────────────────────────────────────────────────┐ │
│  │  Neural Field Storage                                       │ │
│  │  - Memory-mapped files (mmap)                              │ │
│  │  - Multi-resolution hash encoding                          │ │
│  │  - Sparse distributed addressing                           │ │
│  │  - Lazy evaluation                                          │ │
│  └─────────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────┘
                              │
                              │ I/O
                              ▼
              ┌─────────────────────────────┐
              │  Persistent Storage         │
              │  - NVMe SSD array (10×)     │
              │  - HDD archive              │
              │  - Object storage (S3)      │
              └─────────────────────────────┘

Component Architecture

1. Inference Engine

Responsibilities:

  • Process queries from user/application
  • Assemble context from multi-tier memory
  • Execute neural network inference
  • Return results

Interfaces:

pub trait InferenceEngine {
    fn query(&mut self, input: &[f32]) -> Result<Vec<f32>>;
    fn context_size(&self) -> usize;
    fn active_memory(&self) -> usize;
}

Implementation Strategy:

  • Hot Path Optimization: Keep inference loop in L1 cache
  • SIMD Kernels: AVX-512 for matmul, dot products
  • Zero-Copy: Work directly on mmap'd data
  • Async I/O: Non-blocking prefetch requests

2. Memory Manager

Responsibilities:

  • Manage 4-tier hierarchy (DRAM, CXL, SSD, HDD)
  • Page in/out based on access patterns
  • Handle page faults (cold misses)
  • Coordinate with prefetcher

Interfaces:

pub trait MemoryManager {
    fn load_page(&mut self, addr: u64) -> Result<&[f32]>;
    fn evict_page(&mut self, addr: u64) -> Result<()>;
    fn promote(&mut self, addr: u64, target_tier: Tier) -> Result<()>;
    fn demote(&mut self, addr: u64, target_tier: Tier) -> Result<()>;
}

Tier Migration Policy:

enum MigrationPolicy {
    // Promote to faster tier
    Promote {
        trigger: PromoteTrigger,
        target: Tier,
    },

    // Demote to slower tier
    Demote {
        trigger: DemoteTrigger,
        target: Tier,
    },
}

enum PromoteTrigger {
    PredictedAccess(f32),      // Prefetcher confidence
    RecentAccess(Duration),    // Accessed within duration
    HighImportance(f32),       // Semantic importance score
}

enum DemoteTrigger {
    LRU(Duration),             // Not accessed in duration
    CapacityPressure(f32),     // Tier usage > threshold
    LowImportance(f32),        // Semantic importance < threshold
}

Page Replacement Algorithm:

fn evict_candidate(tier: Tier) -> PageId {
    // Weighted LRU + semantic importance
    let mut candidates = tier.pages()
        .filter(|p| !p.is_pinned())
        .collect::<Vec<_>>();

    candidates.sort_by_cached_key(|p| {
        let lru_score = (now() - p.last_access).as_secs();
        let importance = 1.0 / (p.importance + 1e-6);
        (lru_score as f32 * importance) as u64
    });

    candidates[0].id
}

3. Prefetch Predictor

Responsibilities:

  • Predict next N accesses
  • Issue async prefetch requests
  • Update model via streaming learning
  • Track accuracy metrics

Interfaces:

pub trait PrefetchPredictor {
    fn predict(&self, context: &AccessContext) -> Vec<PageId>;
    fn update(&mut self, actual: PageId);
    fn accuracy(&self) -> f32;
}

Hoeffding Tree Implementation:

struct HoeffdingTreePredictor {
    tree: HoeffdingTree,
    feature_window: VecDeque<AccessFeatures>,
    predictions: VecDeque<PageId>,
    hits: usize,
    total: usize,
}

impl PrefetchPredictor for HoeffdingTreePredictor {
    fn predict(&self, context: &AccessContext) -> Vec<PageId> {
        // Extract features
        let features = self.extract_features(context);

        // Predict next 5-10 pages
        let mut predictions = Vec::new();
        for _ in 0..10 {
            let page_id = self.tree.predict(&features);
            predictions.push(page_id);
        }

        predictions
    }

    fn update(&mut self, actual: PageId) {
        // Streaming update
        if let Some(predicted) = self.predictions.pop_front() {
            let correct = predicted == actual;
            if correct {
                self.hits += 1;
            }
            self.total += 1;

            // Update tree
            self.tree.partial_fit(&self.feature_window[0], actual);
        }

        // Slide window
        self.feature_window.push_back(AccessFeatures::from(actual));
        if self.feature_window.len() > 10 {
            self.feature_window.pop_front();
        }
    }

    fn accuracy(&self) -> f32 {
        self.hits as f32 / self.total as f32
    }
}

Feature Engineering:

struct AccessFeatures {
    current_page: PageId,
    recent_history: [PageId; 10],
    semantic_context: [f32; 128],
    time_of_day: f32,
    query_type: u8,
}

impl AccessFeatures {
    fn extract(context: &AccessContext) -> Self {
        Self {
            current_page: context.current_page,
            recent_history: context.history.last_n(10),
            semantic_context: context.embedding,
            time_of_day: context.timestamp.hour() as f32 / 24.0,
            query_type: context.query_type as u8,
        }
    }
}

4. Neural Field Storage

Responsibilities:

  • Memory-map petabyte-scale manifolds
  • Hash-encode addresses (Instant-NGP style)
  • Lazy allocation/evaluation
  • Persist changes to disk

Interfaces:

pub trait NeuralFieldStorage {
    fn read(&self, addr: u64, len: usize) -> Result<&[f32]>;
    fn write(&mut self, addr: u64, data: &[f32]) -> Result<()>;
    fn hash_address(&self, concept: &[f32]) -> u64;
    fn flush(&mut self) -> Result<()>;
}

Memory-Mapped Neural Field:

pub struct MmapNeuralField {
    // Memory-mapped file
    mmap: MmapMut,

    // Virtual address space size
    virtual_size: usize,

    // Physical backing file
    backing_file: File,

    // Multi-resolution hash tables
    hash_tables: Vec<HashTable>,

    // Access tracking
    access_log: AccessLog,
}

impl MmapNeuralField {
    pub fn new(path: impl AsRef<Path>, virtual_size: usize) -> Result<Self> {
        // Create/open backing file
        let file = OpenOptions::new()
            .read(true)
            .write(true)
            .create(true)
            .open(path)?;

        // Set file size
        file.set_len(virtual_size as u64)?;

        // Memory-map
        let mmap = unsafe { MmapMut::map_mut(&file)? };

        Ok(Self {
            mmap,
            virtual_size,
            backing_file: file,
            hash_tables: Self::init_hash_tables(),
            access_log: AccessLog::new(),
        })
    }

    fn init_hash_tables() -> Vec<HashTable> {
        // Multi-resolution à la Instant-NGP
        vec![
            HashTable::new(1 << 16),   // 64K entries
            HashTable::new(1 << 18),   // 256K entries
            HashTable::new(1 << 20),   // 1M entries
            HashTable::new(1 << 22),   // 4M entries
            HashTable::new(1 << 24),   // 16M entries
        ]
    }
}

impl NeuralFieldStorage for MmapNeuralField {
    fn read(&self, addr: u64, len: usize) -> Result<&[f32]> {
        // Bounds check
        let start = addr as usize;
        let end = start + len * std::mem::size_of::<f32>();
        if end > self.virtual_size {
            return Err(Error::OutOfBounds);
        }

        // Direct access to mmap'd memory
        let slice = &self.mmap[start..end];

        // Reinterpret as f32
        let ptr = slice.as_ptr() as *const f32;
        let data = unsafe { std::slice::from_raw_parts(ptr, len) };

        // Log access
        self.access_log.record(addr);

        Ok(data)
    }

    fn write(&mut self, addr: u64, data: &[f32]) -> Result<()> {
        let start = addr as usize;
        let end = start + data.len() * std::mem::size_of::<f32>();
        if end > self.virtual_size {
            return Err(Error::OutOfBounds);
        }

        // Write to mmap'd memory
        let slice = &mut self.mmap[start..end];
        let ptr = slice.as_mut_ptr() as *mut f32;
        let dest = unsafe { std::slice::from_raw_parts_mut(ptr, data.len()) };
        dest.copy_from_slice(data);

        Ok(())
    }

    fn hash_address(&self, concept: &[f32]) -> u64 {
        // Multi-resolution hashing
        let mut hash = 0u64;
        for (i, table) in self.hash_tables.iter().enumerate() {
            let resolution = 1 << i;
            let quantized = quantize(concept, resolution);
            hash ^= table.hash(&quantized);
        }
        hash % (self.virtual_size as u64 / std::mem::size_of::<f32>() as u64)
    }

    fn flush(&mut self) -> Result<()> {
        // Async flush to disk
        self.mmap.flush_async()?;
        Ok(())
    }
}

Hash Encoding:

fn quantize(concept: &[f32], resolution: usize) -> Vec<u8> {
    concept.iter()
        .map(|&x| ((x * resolution as f32).round() as i32).to_le_bytes())
        .flatten()
        .collect()
}

struct HashTable {
    table: Vec<u64>,
}

impl HashTable {
    fn new(size: usize) -> Self {
        Self {
            table: vec![0; size],
        }
    }

    fn hash(&self, data: &[u8]) -> u64 {
        use std::collections::hash_map::DefaultHasher;
        use std::hash::{Hash, Hasher};

        let mut hasher = DefaultHasher::new();
        data.hash(&mut hasher);
        hasher.finish() % self.table.len() as u64
    }
}

Data Structures

Page Descriptor

struct Page {
    id: PageId,
    tier: Tier,
    data: PageData,
    metadata: PageMetadata,
}

struct PageMetadata {
    size: usize,
    last_access: Instant,
    access_count: usize,
    importance: f32,
    is_dirty: bool,
    is_pinned: bool,
}

enum PageData {
    Resident(Vec<f32>),         // In DRAM
    Mapped(MmapRef),            // Memory-mapped
    Evicted(DiskLocation),      // On disk
}

enum Tier {
    L1Dram,
    L2Cxl,
    L3Ssd,
    L4Hdd,
}

Access Log

struct AccessLog {
    entries: RingBuffer<AccessEntry>,
    indices: HashMap<PageId, Vec<usize>>,
}

struct AccessEntry {
    page_id: PageId,
    timestamp: Instant,
    latency: Duration,
    tier: Tier,
}

impl AccessLog {
    fn record(&mut self, page_id: PageId, tier: Tier, latency: Duration) {
        let entry = AccessEntry {
            page_id,
            timestamp: Instant::now(),
            latency,
            tier,
        };

        let index = self.entries.push(entry);
        self.indices.entry(page_id)
            .or_insert_with(Vec::new)
            .push(index);
    }

    fn recent_accesses(&self, duration: Duration) -> impl Iterator<Item = &AccessEntry> {
        let cutoff = Instant::now() - duration;
        self.entries.iter()
            .filter(move |e| e.timestamp > cutoff)
    }

    fn access_pattern(&self, page_id: PageId) -> AccessPattern {
        let indices = self.indices.get(&page_id).unwrap_or(&vec![]);
        let accesses: Vec<_> = indices.iter()
            .map(|&i| &self.entries[i])
            .collect();

        AccessPattern::analyze(&accesses)
    }
}

Algorithms

1. Query Processing

impl InferenceEngine {
    fn query(&mut self, input: &[f32]) -> Result<Vec<f32>> {
        // 1. Hash input to concept address
        let addr = self.storage.hash_address(input);

        // 2. Check if in memory
        let data = match self.memory_mgr.try_load(addr) {
            Some(d) => d,
            None => {
                // 3. Page fault - load from storage
                self.stats.record_miss();
                self.memory_mgr.load_page(addr)?
            }
        };

        // 4. Predict next accesses
        let context = AccessContext::from_current(addr, input);
        let predictions = self.prefetcher.predict(&context);

        // 5. Async prefetch
        for page_id in predictions {
            self.prefetcher.queue_prefetch(page_id);
        }

        // 6. SIMD-accelerated inference
        let output = self.compute_simd(data, input);

        // 7. Update prefetcher
        self.prefetcher.update(addr);

        Ok(output)
    }

    fn compute_simd(&self, weights: &[f32], input: &[f32]) -> Vec<f32> {
        use std::arch::x86_64::*;

        let mut output = vec![0.0f32; weights.len() / input.len()];

        unsafe {
            for (i, chunk) in weights.chunks_exact(input.len()).enumerate() {
                let mut sum = _mm256_setzero_ps();

                for j in (0..input.len()).step_by(8) {
                    let w = _mm256_loadu_ps(&chunk[j]);
                    let x = _mm256_loadu_ps(&input[j]);
                    sum = _mm256_fmadd_ps(w, x, sum);
                }

                // Horizontal sum
                let sum_arr: [f32; 8] = std::mem::transmute(sum);
                output[i] = sum_arr.iter().sum();
            }
        }

        output
    }
}

2. Tier Migration

impl MemoryManager {
    fn migrate_pages(&mut self) {
        // Background task: migrate pages between tiers

        // 1. Identify promotion candidates
        let promote = self.access_log.recent_accesses(Duration::from_secs(60))
            .filter(|e| e.tier != Tier::L1Dram)
            .map(|e| e.page_id)
            .collect::<HashSet<_>>();

        for page_id in promote {
            if let Some(prediction) = self.prefetcher.confidence(page_id) {
                if prediction > 0.8 {
                    self.promote(page_id, Tier::L1Dram)?;
                }
            }
        }

        // 2. Identify demotion candidates
        let demote = self.tiers[Tier::L1Dram]
            .pages()
            .filter(|p| {
                let last_access = Instant::now() - p.last_access;
                last_access > Duration::from_secs(300)
            })
            .map(|p| p.id)
            .collect::<Vec<_>>();

        for page_id in demote {
            self.demote(page_id, Tier::L2Cxl)?;
        }
    }

    fn promote(&mut self, page_id: PageId, target_tier: Tier) -> Result<()> {
        // Load from current tier
        let page = self.load_page(page_id)?;

        // Write to target tier
        self.tiers[target_tier].insert(page_id, page.data.clone())?;

        // Remove from old tier (unless it's persistent storage)
        if page.tier > target_tier {
            self.tiers[page.tier].remove(page_id)?;
        }

        self.stats.record_promotion(page.tier, target_tier);
        Ok(())
    }
}

3. Prefetch Execution

impl PrefetchPredictor {
    fn run_prefetch_loop(&mut self) {
        loop {
            // 1. Get next prediction
            let page_id = self.prefetch_queue.pop();

            // 2. Check if already in fast tier
            if self.memory_mgr.is_in_tier(page_id, Tier::L1Dram) {
                continue;
            }

            // 3. Async load
            let handle = self.async_load(page_id);

            // 4. When complete, promote to L1
            self.pending_prefetches.push((page_id, handle));
        }
    }

    fn async_load(&self, page_id: PageId) -> JoinHandle<Vec<f32>> {
        let storage = self.storage.clone();
        std::thread::spawn(move || {
            storage.read_page(page_id).unwrap()
        })
    }
}

Performance Model

Latency Budget

Target: 1 ms end-to-end query latency

Operation Latency Budget %
Hash address 100 ns 0.01%
L1 DRAM hit 80 ns 0.008%
L2 CXL hit 350 ns 0.035%
L3 SSD hit (prefetched) 80 μs 8%
L4 HDD hit (cold miss) 10 ms 1000%
SIMD inference 500 μs 50%
Prefetch prediction 50 μs 5%
Misc overhead 200 μs 20%

Total (95% L1 hit rate):

  • 95% × 80 ns = 76 ns
  • 4% × 350 ns = 14 ns
  • 1% × 80 μs = 800 ns
  • Inference: 500 μs
  • Total: ~500 μs

Total (with 2.4% L3 miss):

  • 97.6% × 80 ns = 78 ns
  • 2% × 350 ns = 7 ns
  • 0.4% × 80 μs = 320 ns
  • Inference: 500 μs
  • Total: ~500 μs

Throughput Model

Single-threaded:

  • Queries per second: 1 / 500 μs = 2000 QPS

Multi-threaded (16 cores):

  • Queries per second: 2000 × 16 = 32,000 QPS

Batched (batch size 100):

  • Amortize overhead: 200 μs / 100 = 2 μs per query
  • SIMD benefits: 500 μs → 50 μs per query (10× parallelism)
  • Total: ~130 μs per query → 7,700 QPS per core123,000 QPS (16 cores)

Capacity Model

Tier Capacity Active Pages Page Size Total
L1 64 GB 16K 4 MB 64 GB
L2 512 GB 128K 4 MB 512 GB
L3 4 TB 1M 4 MB 4 TB
L4 1 PB 256M 4 MB 1 PB

Total Virtual Address Space: 2^64 bytes = 16 EB

Energy Model

Power Consumption:

Component Idle Active Average (50% util)
CPU (16 cores) 50 W 200 W 125 W
DRAM (64 GB) 20 W 40 W 30 W
CXL (512 GB) 30 W 60 W 45 W
SSD (10×) 50 W 150 W 100 W
HDD (20×) 40 W 100 W 70 W
Total 190 W 550 W 370 W

vs. All-DRAM (1 PB):

  • 1 PB DRAM: ~300 kW (infeasible)
  • DPNC: ~370 W (800× reduction)

Implementation Plan

Phase 1: Foundation (2 weeks)

Week 1: Core data structures

  • MmapNeuralField implementation
  • Page and PageMetadata
  • AccessLog ring buffer
  • Basic hash encoding

Week 2: Memory management

  • MemoryManager with 2 tiers (DRAM, SSD)
  • LRU eviction
  • Sync page load
  • Unit tests

Deliverable: Can mmap 10 GB neural field, load pages on demand


Phase 2: Intelligence (2 weeks)

Week 3: Prefetch predictor

  • Hoeffding Tree implementation
  • Feature extraction
  • Streaming updates
  • Accuracy tracking

Week 4: Async prefetching

  • Prefetch queue
  • Async I/O with tokio
  • Integration with memory manager
  • Benchmarks

Deliverable: 95%+ prefetch accuracy on synthetic workload


Phase 3: Optimization (2 weeks)

Week 5: SIMD acceleration

  • AVX-512 kernels for matmul
  • Zero-copy mmap access
  • Benchmark vs. baseline
  • Profiling and tuning

Week 6: Multi-tier

  • Add L2 (CXL or simulated)
  • Add L4 (HDD)
  • Tier migration policies
  • End-to-end benchmarks

Deliverable: 8× SIMD speedup, <500 μs query latency


Phase 4: Scale (2 weeks)

Week 7: Petabyte scale

  • Sparse hash addressing
  • Multi-SSD parallelism (10× SSDs)
  • Continuous learning for 1 week (24/7)
  • Stability testing

Week 8: Production hardening

  • Error handling
  • Crash recovery
  • Monitoring/metrics
  • Documentation

Deliverable: 1 PB virtual space, robust production system


Success Metrics

Metric Target Measurement
Virtual Capacity 1 PB Virtual address space size
Physical Footprint 64 GB DRAM + 4 TB SSD Actual allocation
Query Latency (p50) <500 μs Histogram
Query Latency (p99) <5 ms Histogram
Prefetch Accuracy >95% Hits / Total
Throughput >10K QPS Queries per second
Energy <400 W Power meter
SIMD Speedup >5× vs. scalar baseline

Conclusion

This architecture synthesizes cutting-edge techniques from systems, ML, and hardware to achieve petabyte-scale continuous cognition. The design is implementable today with commodity hardware (NVMe SSDs, DRAM, CPUs with AVX-512).

Key Innovations:

  1. Memory-mapped neural fields for zero-copy access
  2. Multi-tier hierarchy mirroring human memory
  3. Predictive prefetching with streaming ML
  4. SIMD-accelerated inference on mmap'd data

Expected Outcome: A working system demonstrating <1 ms retrieval from 1 PB knowledge manifold.


Architecture designed: 2025-12-04 Target: Production deployment 2026-Q2