wifi-densepose/docs/research/latent-space/latent-graph-interplay.md

# Latent Space ↔ Graph Reality Interplay

## Executive Summary

This document explores the fundamental relationship between **high-dimensional latent space** (where embeddings live) and **graph reality** (the actual topology). This interplay is central to GNN effectiveness: we must encode graph structure into latent representations while ensuring latent similarity reflects topological proximity.

**Central Question**: How do we optimally bridge continuous, high-dimensional embedding geometry with discrete, sparse graph topology?

---

## 1. The Two Worlds

### 1.1 Latent Space Characteristics

**Definition**: Continuous, high-dimensional vector space where node embeddings reside

```
Latent Space L: R^d where d ∈ [64, 1024+]
Node embedding: h_v ∈ L
Distance metric: d(h_u, h_v) (cosine, L2, etc.)
```

**Properties**:
- ✓ **Continuous**: Smooth interpolation between points
- ✓ **Dense**: Every point is surrounded by infinitely many neighbors
- ✓ **High-Dimensional**: Curse of dimensionality, but expressive
- ✓ **Metric**: Equipped with distance/similarity function
- ✗ **Isotropic**: May not preserve hierarchical structure
- ✗ **Euclidean Bias**: Most operations assume flat geometry

**Current RuVector Latent Space**:
```rust
// From layer.rs:337-350
RuvectorLayer {
    input_dim: usize,   // Typically 64-256
    hidden_dim: usize,  // Typically 128-512
    // Embeddings live in R^{hidden_dim}
}
```

### 1.2 Graph Reality Characteristics

**Definition**: Discrete topological structure G = (V, E)

```
Graph G:
- Vertices: V = {v_1, ..., v_n}
- Edges: E ⊆ V × V
- Neighborhoods: N(v) = {u : (u,v) ∈ E}
- Topology: Small-world, scale-free, hierarchical, etc.
```

**Properties**:
- ✓ **Discrete**: Finite nodes and edges
- ✓ **Sparse**: |E| << |V|² (typically)
- ✓ **Structured**: Communities, hierarchies, motifs
- ✓ **Relational**: Explicit connections
- ✗ **Non-Metric**: Shortest path not always meaningful
- ✗ **Heterogeneous**: Variable degree, asymmetric

**RuVector Graph (HNSW)**:
```
Hierarchical Navigable Small World:
- Layer 0: Dense graph (M = 16-64 neighbors)
- Layer 1+: Sparse graphs (long-range connections)
- Navigable: Greedy search finds approximate NN
- Small-world: Low diameter, high clustering
```

---

## 2. The Fundamental Tension

### 2.1 Embedding Paradox

**Goal 1**: Preserve graph topology in latent space
```
If (u, v) ∈ E, then ||h_u - h_v|| should be small
```

**Goal 2**: Preserve latent similarity in graph
```
If ||h_u - h_v|| is small, then u and v should be related
```

**Paradox**: These are not equivalent!
- **Graph neighbors** may be semantically different (e.g., bridge edges)
- **Latent neighbors** may not be graph-connected (e.g., same cluster, different components)

### 2.2 Information Bottleneck

```
Graph G ──encode──> Latent h ──decode──> Graph G'
          (GNN)                  (predict edges/nodes)
```

**Bottleneck**: Fixed-dimensional h must compress all information from:
- Node features
- Local topology (ego-net)
- Global structure (communities, paths)
- Edge attributes
- Dynamic patterns

**Trade-off**:
- **High dimensions**: More expressive, but curse of dimensionality
- **Low dimensions**: Efficient, but lossy compression

---

## 3. Manifold Hypothesis for Graphs

### 3.1 Low-Dimensional Manifold

**Hypothesis**: Graph-structured data lies on a low-dimensional manifold embedded in high-dimensional space

```
True data distribution: P_data(h) supported on manifold M ⊂ R^d
where dim(M) << d
```

**Implications**:
1. **Intrinsic Dimensionality**: Effective degrees of freedom much less than d
2. **Local Linearity**: Small neighborhoods approximately Euclidean
3. **Global Curvature**: Manifold may be curved (non-Euclidean)

**Evidence in RuVector**:
- HNSW assumes low intrinsic dimension for efficient search
- Multi-head attention learns multiple "views" of manifold
- Layer normalization assumes local isotropy

### 3.2 Geometric Structure of Graph Embeddings

**Question**: What geometry best represents graphs?

**Option 1: Euclidean (Current)**
```
h ∈ R^d, distance = ||h_u - h_v||_2
```
- ✓ Simple, well-understood
- ✓ Efficient operations (dot products, linear maps)
- ✗ Poor for tree-like structures
- ✗ Exponential capacity limited

**Option 2: Hyperbolic (Poincaré Ball)**
```
h ∈ B^d = {x ∈ R^d : ||x|| < 1}
distance = arcosh(1 + 2||x-y||² / ((1-||x||²)(1-||y||²)))
```
- ✓ **Exponential capacity**: Volume grows exponentially with radius
- ✓ **Natural hierarchies**: Tree embeddings with low distortion
- ✓ **HNSW synergy**: Hierarchical layers naturally hyperbolic
- ✗ More complex operations
- ✗ Numerical instability near boundary

**Option 3: Mixed Curvature (Product Manifolds)**
```
h = (h_euclidean, h_hyperbolic, h_spherical)
Combine different geometries for different aspects
```

### 3.3 Curvature and Graph Structure

**Relationship**:
```
Negative curvature (hyperbolic) ↔ Tree-like, hierarchical
Zero curvature (Euclidean) ↔ Grid-like, regular
Positive curvature (spherical) ↔ Cyclic, clustered
```

**HNSW Topology**:
- **Layer 0**: Locally grid-like (Euclidean)
- **Higher layers**: Tree-like navigation (Hyperbolic)
- **Overall**: Mixed curvature

**Implication**: Single geometry may be suboptimal; consider mixed-curvature embeddings

---

## 4. Encoding: Graph → Latent Space

### 4.1 Message Passing Framework (Current)

**Goal**: Aggregate neighborhood information into node embedding

```
From layer.rs:362-401 (RuvectorLayer.forward):

h_v^{(l+1)} = UPDATE(
    h_v^{(l)},
    AGGREGATE({m_u^{(l)} : u ∈ N(v)}),
    TRANSFORM(h_v^{(l)})
)
```

**Current Pipeline**:
1. **Message**: `m_u = W_msg · h_u`
2. **Attention Aggregate**: `a_v = MultiHeadAttention(h_v, {h_u})`
3. **Weighted Aggregate**: `agg_v = Σ w_uv · m_u`
4. **Combine**: `combined = a_v + agg_v`
5. **Update**: `h'_v = GRU(W_agg · combined, h_v)`
6. **Normalize**: `output = LayerNorm(Dropout(h'_v))`

**What This Encodes**:
- ✓ Local neighborhood structure (1-hop in one layer)
- ✓ Neighbor feature aggregation
- ✓ Temporal dynamics (via GRU)
- ✗ Global structure (requires stacking layers)
- ✗ Structural properties (degree, centrality, etc.)
- ✗ Edge semantics (only weights, not features)

### 4.2 Multi-Hop Information Propagation

**Challenge**: K-layer GNN sees only K-hop neighborhood

```
Receptive field after L layers: d_graph(u, v) ≤ L
```

**For HNSW**:
- Layer 0 average degree: ~50
- Layer 1 average degree: ~10
- Exponential reduction in higher layers

**Trade-off**:
- **Many layers**: Large receptive field, but over-smoothing
- **Few layers**: Localized, but miss global context

**Over-Smoothing Problem**:
```
As L → ∞, all node embeddings converge to the same value:
h_v^{(∞)} → E[h] for all v
```

**Mitigation Strategies**:
1. **Skip Connections**: `h^{(l+1)} = h^{(l)} + GNN^{(l)}(h^{(l)})`
2. **Residual GRU**: Implicit in `h_t = (1-z_t)h_{t-1} + z_t h̃_t`
3. **Jumping Knowledge**: Concatenate all layer outputs
4. **Adaptive Depth**: Learn when to stop propagating

### 4.3 Structural Features Beyond Neighborhoods

**Current limitation**: Only neighbor features, not structural properties

**Missing Encodings**:
1. **Node Degree**: `deg(v) = |N(v)|`
2. **Clustering Coefficient**: `C(v) = |{(u,w) ∈ E : u,w ∈ N(v)}| / (deg(v) choose 2)`
3. **Centrality**: Betweenness, closeness, eigenvector
4. **Community Membership**: Detected clusters
5. **HNSW Layer**: Which layers the node appears in

**Proposed Enhancement**:
```rust
pub struct StructuralFeatures {
    degree: f32,
    clustering_coef: f32,
    hnsw_layers: Vec<usize>,  // Layers this node appears in
    centrality: f32,
}

impl RuvectorLayer {
    fn forward_with_structural(
        &self,
        node_embedding: &[f32],
        neighbor_embeddings: &[Vec<f32>],
        edge_weights: &[f32],
        structural_features: &StructuralFeatures,  // NEW
    ) -> Vec<f32> {
        // Concatenate structural features to embedding
        let augmented = [node_embedding, &structural_features.to_vec()].concat();

        // Proceed with standard forward pass
        // ...
    }
}
```

---

## 5. Decoding: Latent Space → Graph Predictions

### 5.1 Link Prediction

**Goal**: Predict edge existence from embeddings

```
Score(u, v) = f(h_u, h_v)
P((u,v) ∈ E) = σ(Score(u, v))
```

**Scoring Functions**:

**1. Dot Product** (Current in search.rs)
```rust
score = h_u.dot(h_v)
```
- ✓ Fast O(d)
- ✗ Not invariant to scaling

**2. Cosine Similarity** (Current in search.rs)
```rust
score = h_u.dot(h_v) / (||h_u|| · ||h_v||)
```
- ✓ Scale-invariant
- ✓ Natural for normalized embeddings
- ✗ Ignores magnitude information

**3. Distance-Based**
```rust
score = -||h_u - h_v||²
```
- ✓ Metric structure
- ✗ Negative, unbounded

**4. Bilinear**
```rust
score = h_u^T W h_v
```
- ✓ Learnable asymmetry
- ✗ O(d²) parameters

**5. MLP (Most Expressive)**
```rust
score = MLP([h_u || h_v || (h_u ⊙ h_v)])
```
- ✓ Highly expressive
- ✗ Expensive O(d²) or more

**RuVector Current**:
```rust
// From search.rs:4-18
pub fn cosine_similarity(a: &[f32], b: &[f32]) -> f32 {
    let dot_product: f32 = a.iter().zip(b.iter()).map(|(x, y)| x * y).sum();
    let norm_a: f32 = (a.iter().map(|&x| (x as f64) * (x as f64)).sum::<f64>().sqrt()) as f32;
    let norm_b: f32 = (b.iter().map(|&x| (x as f64) * (x as f64)).sum::<f64>().sqrt()) as f32;

    if norm_a == 0.0 || norm_b == 0.0 {
        0.0
    } else {
        dot_product / (norm_a * norm_b)
    }
}
```

### 5.2 Node Classification

**Goal**: Predict node labels from embeddings

```
From graph_neural.rs:82-98:
classify_node(h_v) → class probabilities
```

**Typical Approach**:
```
logits = W_class · h_v + b
probs = softmax(logits)
```

**Challenge**: Embedding must encode label-relevant information

### 5.3 Graph Reconstruction

**Goal**: Reconstruct adjacency matrix from embeddings

**Autoencoder Framework**:
```
Encoder: A → H (GNN)
Decoder: H → A' (pairwise scoring)

Loss: ||A - A'||² or Cross-Entropy
```

**Reconstruction Loss**:
```rust
// Proposed
pub fn graph_reconstruction_loss(
    embeddings: &[Vec<f32>],
    adjacency: &[(usize, usize)],  // True edges
) -> f32 {
    let mut loss = 0.0;
    let n = embeddings.len();

    // Positive edges (should have high score)
    for &(i, j) in adjacency {
        let score = cosine_similarity(&embeddings[i], &embeddings[j]);
        loss -= (score + 1e-10).ln();  // -log(score)
    }

    // Negative sampling (non-edges should have low score)
    for _ in 0..adjacency.len() {
        let i = rand::random::<usize>() % n;
        let j = rand::random::<usize>() % n;
        if !adjacency.contains(&(i, j)) {
            let score = cosine_similarity(&embeddings[i], &embeddings[j]);
            loss -= (1.0 - score + 1e-10).ln();  // -log(1 - score)
        }
    }

    loss / (2 * adjacency.len()) as f32
}
```

---

## 6. Information-Theoretic Perspective

### 6.1 Mutual Information

**Goal**: Maximize mutual information between graph structure G and embeddings H

```
max I(G; H) = H(G) - H(G|H)
            = H(H) - H(H|G)
```

**Interpretation**:
- `I(G; H)` measures how much knowing H tells us about G
- Perfect encoding: `I(G; H) = H(G)` (H captures all graph info)
- Independence: `I(G; H) = 0` (H tells nothing about G)

**Challenges**:
1. **Intractability**: Computing I(G; H) is hard
2. **Continuous H**: Differential entropy unbounded
3. **Discrete G**: Entropy depends on graph size

### 6.2 Deep Graph Infomax (DGI)

**Idea**: Maximize MI between node embeddings and graph summary

```
DGI Loss:
  max I(h_v; h_G)

where:
  h_v: node embedding
  h_G: graph-level summary (e.g., mean pooling)
```

**Implementation**:
```rust
pub fn deep_graph_infomax_loss(
    node_embeddings: &[Vec<f32>],
    graph_summary: &[f32],  // Readout function output
    negative_samples: &[Vec<f32>],  // Corrupted embeddings
) -> f32 {
    let mut loss = 0.0;

    // Positive samples: real (node, graph) pairs
    for h_v in node_embeddings {
        let score = discriminator(h_v, graph_summary);  // MLP or bilinear
        loss -= (sigmoid(score) + 1e-10).ln();
    }

    // Negative samples: (corrupted node, graph) pairs
    for h_neg in negative_samples {
        let score = discriminator(h_neg, graph_summary);
        loss -= (1.0 - sigmoid(score) + 1e-10).ln();
    }

    loss / (node_embeddings.len() + negative_samples.len()) as f32
}
```

**Readout Functions** (graph summary):
1. **Mean**: `h_G = (1/n) Σ_v h_v`
2. **Max**: `h_G = max_v h_v` (element-wise)
3. **Attention**: `h_G = Σ_v α_v h_v` where `α_v = softmax(MLP(h_v))`

### 6.3 Information Bottleneck Principle

**Principle**: Find minimal sufficient representation

```
min I(X; H) - β I(H; Y)

where:
  X: input features
  H: learned embeddings
  Y: prediction target
  β: trade-off parameter
```

**Graph Context**:
- `X`: Node features + neighborhood structure
- `H`: Node embeddings
- `Y`: Downstream task (link, classification)

**Goal**: Compress X into H, retaining only task-relevant information

**Implementation Strategy**:
1. **Variational Bound**: Use VAE-style reparameterization
2. **Lagrange Multiplier**: β controls compression vs. performance
3. **Regularization**: Encourage low mutual information I(X; H)

---

## 7. Contrastive Learning for Graph-Latent Alignment

### 7.1 Contrastive Objectives

**Core Idea**: Pull together related nodes in latent space, push apart unrelated nodes

**InfoNCE Loss** (Current in training.rs:362-411):
```rust
pub fn info_nce_loss(
    anchor: &[f32],
    positives: &[&[f32]],
    negatives: &[&[f32]],
    temperature: f32,
) -> f32
```

**Mathematical Form**:
```
L_InfoNCE = -log(exp(sim(h_v, h_+) / τ) / (exp(sim(h_v, h_+) / τ) + Σ_{h_- ∈ N} exp(sim(h_v, h_-) / τ)))
```

**What This Optimizes**:
- Positive pairs `(h_v, h_+)`: Graph neighbors, semantically similar
- Negative pairs `(h_v, h_-)`: Non-neighbors, dissimilar
- Temperature τ: Controls hardness of negatives

### 7.2 Local Contrastive Loss (Graph-Specific)

**Current Implementation** (training.rs:444-462):
```rust
pub fn local_contrastive_loss(
    node_embedding: &[f32],
    neighbor_embeddings: &[Vec<f32>],
    non_neighbor_embeddings: &[Vec<f32>],
    temperature: f32,
) -> f32
```

**Graph-Aware Sampling**:
- **Positives**: Direct graph neighbors `N(v)`
- **Negatives**: Non-neighbors (random or hard negatives)

**Variants**:

**1. K-Hop Positives**
```
Positives = {u : d_graph(v, u) ≤ K}
Encourages multi-hop proximity in latent space
```

**2. Community-Based**
```
Positives = {u : community(v) = community(u)}
Negatives = {u : community(v) ≠ community(u)}
Encourages cluster separation
```

**3. HNSW Layer-Based**
```
Positives = {u : (u,v) ∈ E_layer_k}
Different contrastive losses per HNSW layer
```

### 7.3 Hard Negative Mining

**Problem**: Random negatives are often too easy

**Solution**: Sample hard negatives (latent-close but graph-far)

```rust
fn sample_hard_negatives(
    node: &[f32],
    all_embeddings: &[Vec<f32>],
    true_neighbors: &[usize],
    k: usize,
) -> Vec<Vec<f32>> {
    // 1. Compute similarities to all nodes
    let similarities: Vec<(usize, f32)> = all_embeddings
        .iter()
        .enumerate()
        .filter(|(i, _)| !true_neighbors.contains(i))  // Exclude true neighbors
        .map(|(i, emb)| (i, cosine_similarity(node, emb)))
        .collect();

    // 2. Sort by similarity (descending)
    similarities.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());

    // 3. Take top-k (most similar non-neighbors = hard negatives)
    similarities.iter()
        .take(k)
        .map(|(i, _)| all_embeddings[*i].clone())
        .collect()
}
```

**Benefits**:
- Focuses learning on difficult cases
- Improves discrimination boundaries
- Speeds up convergence

---

## 8. Spectral Methods and Graph Signals

### 8.1 Graph Laplacian

**Normalized Laplacian**:
```
L = I - D^(-1/2) A D^(-1/2)

where:
  A: adjacency matrix
  D: degree matrix (diagonal)
```

**Eigendecomposition**:
```
L = U Λ U^T

where:
  U: eigenvectors (graph Fourier basis)
  Λ: eigenvalues (frequencies)
```

**Interpretation**:
- Small eigenvalues ↔ Low-frequency (smooth signals)
- Large eigenvalues ↔ High-frequency (oscillatory signals)

### 8.2 Spectral GNN Connection

**Classical GCN** (Kipf & Welling):
```
H^{(l+1)} = σ(D̃^(-1/2) Ã D̃^(-1/2) H^{(l)} W^{(l)})

where Ã = A + I (self-loops)
```

**Spectral Interpretation**:
- Aggregation = Low-pass filter
- Smooths node features along graph structure
- Eigenvalues control extent of smoothing

**RuVector's Approach** (Spatial, not spectral):
- Message passing is spatial formulation
- Attention adds adaptive filtering
- GRU adds temporal component

**Missing Spectral Component**:
- No explicit frequency analysis
- Could add spectral loss to preserve frequency content

### 8.3 Spectral Loss Functions

**Goal**: Preserve graph spectral properties in embeddings

**Laplacian Eigenmaps**:
```
min_H Σ_{(i,j) ∈ E} ||h_i - h_j||²
subject to H^T H = I

Equivalent to minimizing: Tr(H^T L H)
```

**Implementation**:
```rust
pub fn spectral_loss(
    embeddings: &[Vec<f32>],
    adjacency: &[(usize, usize)],
    degrees: &[f32],
) -> f32 {
    let mut loss = 0.0;

    // Laplacian regularization: ||h_i - h_j||² for edges
    for &(i, j) in adjacency {
        let diff = subtract(&embeddings[i], &embeddings[j]);
        let norm_sq = l2_norm_squared(&diff);

        // Normalize by degrees
        let weight = 1.0 / (degrees[i].sqrt() * degrees[j].sqrt());
        loss += weight * norm_sq;
    }

    loss
}
```

**Benefits**:
- Smooth embeddings along graph structure
- Preserves community structure
- Theoretical guarantees (Laplacian eigenmaps)

---

## 9. Disentangled Representations

### 9.1 Motivation

**Problem**: Current embeddings are entangled (single vector encodes everything)

**Goal**: Separate embedding into interpretable factors

```
h_v = [h_structural || h_semantic || h_temporal]

where:
  h_structural: Topology (degree, centrality, etc.)
  h_semantic: Feature content
  h_temporal: Dynamics (for evolving graphs)
```

### 9.2 β-VAE for Graphs

**Variational Autoencoder with Disentanglement**:
```
Encoder: (X_v, N(v)) → q(z_v | X_v, N(v))
Decoder: z_v → p(X_v, N(v) | z_v)

Loss: L_VAE = E[log p(X_v | z_v)] - β KL(q(z_v) || p(z_v))
```

**β > 1**: Encourages disentanglement (independence of latent factors)

**Implementation Sketch**:
```rust
pub struct GraphVAE {
    encoder: RuvectorLayer,
    mu_layer: Linear,
    logvar_layer: Linear,
    decoder: Linear,
}

impl GraphVAE {
    fn encode(&self, node_features: &[f32], neighbors: &[Vec<f32>]) -> (Vec<f32>, Vec<f32>) {
        let h = self.encoder.forward(node_features, neighbors, &[]);
        let mu = self.mu_layer.forward(&h);
        let logvar = self.logvar_layer.forward(&h);
        (mu, logvar)
    }

    fn reparameterize(&self, mu: &[f32], logvar: &[f32]) -> Vec<f32> {
        let std: Vec<f32> = logvar.iter().map(|&lv| (lv / 2.0).exp()).collect();
        let eps: Vec<f32> = (0..mu.len()).map(|_| rand::thread_rng().sample(StandardNormal)).collect();

        mu.iter().zip(std.iter()).zip(eps.iter())
            .map(|((&m, &s), &e)| m + s * e)
            .collect()
    }

    fn forward(&self, node_features: &[f32], neighbors: &[Vec<f32>]) -> (Vec<f32>, f32) {
        let (mu, logvar) = self.encode(node_features, neighbors);
        let z = self.reparameterize(&mu, &logvar);

        // Reconstruct node features
        let recon = self.decoder.forward(&z);

        // KL divergence
        let kl: f32 = mu.iter().zip(logvar.iter())
            .map(|(&m, &lv)| -0.5 * (1.0 + lv - m*m - lv.exp()))
            .sum();

        (recon, kl)
    }
}
```

### 9.3 Disentanglement Metrics

**1. Mutual Information Gap (MIG)**
```
MIG(z, y) = (1/K) Σ_k (I(z; y_k)_largest - I(z; y_k)_2nd_largest) / H(y_k)

Measures how uniquely each latent factor captures each ground-truth factor
```

**2. SAP (Separated Attribute Predictability)**
```
Train linear classifiers z → y for each attribute
Measure how well z predicts individual factors
```

**Application to Graphs**:
- Ground-truth factors: Degree, clustering, centrality, community
- Learned latent: h_v
- Metric: MIG or SAP between h_v components and structural properties

---

## 10. Hierarchical Representations (HNSW-Specific)

### 10.1 Multi-Scale Embeddings

**Idea**: Different embeddings for different HNSW layers

```
Node v appears in layers {0, 2, 3}:
  h_v^{(0)}: Dense, local structure
  h_v^{(2)}: Coarse, medium-range
  h_v^{(3)}: Global, long-range hubs
```

**Hierarchical Encoding**:
```rust
pub struct HierarchicalEmbedding {
    embeddings_by_layer: HashMap<usize, Vec<f32>>,
}

impl HierarchicalEmbedding {
    fn get_embedding(&self, layer: usize) -> &Vec<f32> {
        self.embeddings_by_layer.get(&layer)
            .expect("Node not in this layer")
    }

    // Interpolate between layers for search
    fn interpolated_embedding(&self, target_layer: f32) -> Vec<f32> {
        let layer_low = target_layer.floor() as usize;
        let layer_high = target_layer.ceil() as usize;

        if layer_low == layer_high {
            return self.get_embedding(layer_low).clone();
        }

        let alpha = target_layer - layer_low as f32;
        let emb_low = self.get_embedding(layer_low);
        let emb_high = self.get_embedding(layer_high);

        // Linear interpolation
        emb_low.iter().zip(emb_high.iter())
            .map(|(&l, &h)| (1.0 - alpha) * l + alpha * h)
            .collect()
    }
}
```

**Hierarchical Loss**:
```rust
fn hierarchical_contrastive_loss(
    node_hierarchical_emb: &HierarchicalEmbedding,
    neighbors_by_layer: &HashMap<usize, Vec<HierarchicalEmbedding>>,
) -> f32 {
    let mut loss = 0.0;

    // Contrastive loss at each layer
    for (layer, layer_neighbors) in neighbors_by_layer {
        let h_v = node_hierarchical_emb.get_embedding(*layer);
        let positives: Vec<&Vec<f32>> = layer_neighbors.iter()
            .map(|n| n.get_embedding(*layer))
            .collect();

        // Sample negatives from other layers
        let negatives = sample_negatives_other_layers(neighbors_by_layer, *layer);

        loss += info_nce_loss(h_v, &positives, &negatives, 0.07);
    }

    loss / neighbors_by_layer.len() as f32
}
```

### 10.2 Coarse-to-Fine Alignment

**Goal**: Ensure consistency across HNSW layers

```
Alignment Loss:
  L_align = Σ_v Σ_{l < l'} ||h_v^{(l)} - Project(h_v^{(l')})||²

where Project: R^{d_high} → R^{d_low} (e.g., learned linear map)
```

**Benefits**:
- Global structure (high layers) guides local (low layers)
- Enables layer-skipping (jump from layer 3 to layer 0 embedding)
- Multi-resolution representation

---

## 11. Practical Strategies for RuVector

### 11.1 Short-Term Enhancements

**1. Structural Feature Augmentation**
```rust
// Add degree, clustering, HNSW layer info to embeddings
let augmented_embedding = [
    &node_embedding[..],
    &[degree as f32],
    &[clustering_coef],
    &one_hot_layer[..],
].concat();
```

**2. Spectral Regularization**
```rust
// Add spectral loss to training
total_loss = contrastive_loss + λ_spectral * spectral_loss
```

**3. Hard Negative Sampling**
```rust
// Replace random negatives with hard negatives in local_contrastive_loss
let hard_negatives = sample_hard_negatives(node, all_embeddings, neighbors, k);
let loss = info_nce_loss(node, &neighbors, &hard_negatives, temperature);
```

### 11.2 Medium-Term Research

**4. Hierarchical Embeddings per HNSW Layer**
```rust
pub struct HNSWHierarchicalGNN {
    gnn_layers_by_hnsw_level: Vec<RuvectorLayer>,
}
```

**5. Hyperbolic Embeddings for Higher Layers**
```rust
// Layer 0: Euclidean (local, grid-like)
// Layer 1+: Hyperbolic (hierarchical navigation)
pub enum GeometricEmbedding {
    Euclidean(Vec<f32>),
    Hyperbolic(Vec<f32>),  // Poincaré ball
}
```

**6. Disentangled VAE**
```rust
// Separate structural vs. semantic information
pub struct DisentangledGraphVAE {
    structural_encoder: RuvectorLayer,
    semantic_encoder: RuvectorLayer,
    decoder: Linear,
}
```

### 11.3 Long-Term Exploration

**7. Information Bottleneck Optimization**
- Minimize I(X; H) while maximizing I(H; Y)
- Variational bounds for tractability
- Beta-annealing schedule

**8. Graph Transformers**
- Replace message passing with full attention
- Positional encodings (Laplacian eigenvectors, RoPE)
- Layer-wise multi-scale attention

**9. Neural ODEs for Continuous Depth**
```
dh/dt = GNN(h(t), G)
h(T) = h(0) + ∫₀^T GNN(h(t), G) dt
```

---

## 12. Evaluation Metrics for Latent-Graph Alignment

### 12.1 Reconstruction Metrics

**1. Link Prediction AUC**
```
Measure how well latent similarity predicts edges
AUC-ROC on link prediction task
```

**2. Graph Reconstruction Error**
```
||A - σ(H H^T)||²_F
where A is adjacency, H is embeddings
```

### 12.2 Structural Preservation

**3. Rank Correlation**
```
Spearman ρ between:
  - Graph distance d_G(u, v)
  - Latent distance d_L(h_u, h_v)
```

**4. Distortion**
```
max_{u,v} |d_L(h_u, h_v) - d_G(u, v)|
Worst-case embedding distortion
```

**5. Average Distortion**
```
(1/|V|²) Σ_{u,v} |d_L(h_u, h_v) - d_G(u, v)|
```

### 12.3 Downstream Task Performance

**6. Node Classification Accuracy**
```
Train classifier on embeddings, test accuracy
```

**7. Clustering Modularity**
```
K-means on embeddings, measure graph modularity
```

**8. HNSW Search Quality**
```
Recall@K using learned embeddings vs. original features
```

---

## References

### Papers

**Manifold Learning**:
1. Tenenbaum et al. (2000) - A Global Geometric Framework for Nonlinear Dimensionality Reduction (Isomap)
2. Belkin & Niyogi (2003) - Laplacian Eigenmaps for Dimensionality Reduction

**Hyperbolic Embeddings**:
3. Nickel & Kiela (2017) - Poincaré Embeddings for Learning Hierarchical Representations
4. Chami et al. (2019) - Hyperbolic Graph Convolutional Neural Networks

**Information Theory**:
5. Tishby & Zaslavsky (2015) - Deep Learning and the Information Bottleneck Principle
6. Velickovic et al. (2019) - Deep Graph Infomax

**Contrastive Learning**:
7. Chen et al. (2020) - A Simple Framework for Contrastive Learning of Visual Representations (SimCLR)
8. You et al. (2020) - Graph Contrastive Learning with Augmentations

**Disentanglement**:
9. Higgins et al. (2017) - β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework
10. Ma et al. (2019) - Disentangled Graph Convolutional Networks

### RuVector Code
- `crates/ruvector-gnn/src/layer.rs` - GNN encoding
- `crates/ruvector-gnn/src/search.rs` - Latent similarity (decoding)
- `crates/ruvector-gnn/src/training.rs` - Contrastive losses (alignment)

---

**Document Version**: 1.0
**Last Updated**: 2025-11-30
**Author**: RuVector Research Team