# ADR-005: Graph Neural Network Protein Structure Engine **Status**: Proposed **Date**: 2026-02-11 **Authors**: ruv.io, RuVector Team **Deciders**: Architecture Review Board **Target Crates**: `ruvector-gnn`, `ruvector-graph` ## Version History | Version | Date | Author | Changes | |---------|------|--------|---------| | 0.1 | 2026-02-11 | ruv.io | Initial practical implementation proposal | --- ## Context Protein structure prediction and interaction analysis are fundamental to drug discovery, variant effect prediction, and understanding disease mechanisms. Graph neural networks naturally represent protein structures at multiple scales, from atomic interactions to protein-protein interaction networks. State-of-the-art approaches include: - **ESMFold**: Meta's protein structure prediction using protein language models, achieving AlphaFold2-competitive accuracy without MSAs - **AlphaFold2 Evoformer**: Iterative attention over MSAs and pairwise representations, O(N²) complexity - **ProteinMPNN**: Message passing for inverse protein design, generates sequences matching target structures - **GearNet**: Geometry-aware relational graph neural network for protein representation learning RuVector's existing `ruvector-gnn` crate provides the foundational primitives for building protein graph models: ```rust // Core layers available today pub struct Linear { fn new(input_dim, output_dim), fn forward(&[f32]) -> Vec } pub struct LayerNorm { fn new(dim, eps), fn forward(&[f32]) -> Vec } pub struct MultiHeadAttention { fn new(embed_dim, num_heads), fn forward(query, keys, values) -> Vec } pub struct GRUCell { fn new(input_dim, hidden_dim), fn forward(input, hidden) -> Vec } pub struct RuvectorLayer { fn new(input_dim, hidden_dim, heads, dropout), fn forward(...) } pub struct Tensor { fn new(Vec, Vec), fn matmul(), fn dot() } pub struct Optimizer { fn new(OptimizerType), fn step(params, grads) } // Loss functions fn info_nce_loss(query, positive, negatives) -> f32 fn local_contrastive_loss(embeddings, labels) -> f32 ``` --- ## Decision ### Implement a Practical Protein Graph Engine Using Existing ruvector-gnn Infrastructure We will build a `ProteinGraphEngine` that: 1. Represents protein contact graphs using `ruvector-graph` for storage and query 2. Implements residue-level message passing via `RuvectorLayer` for contact prediction 3. Applies GNN-based approaches to protein interaction prediction (PPI) 4. Integrates with the genomic attention layers (ADR-001 through ADR-004) for variant effect analysis **What works today**: GNN message passing layers, graph storage, HNSW indexing **What needs building**: SE(3) equivariant layers, protein-specific feature encoders, specialized architectures --- ## Architecture ### 1. Residue Contact Graph Construction **Goal**: Predict residue-residue contacts from sequence, enabling structure prediction. **Graph representation**: ``` G_contact = (V, E, X_v, X_e) V = {r_1, r_2, ..., r_N} -- one node per residue E = {(r_i, r_j) : predicted contact or known from structure} X_v in R^{N x d_v} where d_v = 41: - Amino acid type (20-dim one-hot) - Secondary structure (3-dim: helix, strand, coil) - Relative position i/N (1-dim) - Conservation score (1-dim) - MSA-derived features (16-dim) X_e in R^{|E| x d_e} where d_e = 7: - Sequence separation |i-j|/N (1-dim) - Co-evolution score (1-dim) - Distance encoding (5-dim RBF basis) ``` **ruvector-graph storage**: ```rust use ruvector_graph::{GraphDB, NodeBuilder, EdgeBuilder}; pub struct ProteinContactGraph { db: GraphDB, protein_id: String, } impl ProteinContactGraph { pub fn from_sequence(sequence: &str, msa: Option<&MultipleAlignment>) -> Self { let mut db = GraphDB::new(); let n = sequence.len(); // Add residue nodes for (i, aa) in sequence.chars().enumerate() { let features = encode_residue_features(aa, i, n, msa); db.add_node(NodeBuilder::new() .with_label("Residue") .with_property("index", i) .with_property("amino_acid", aa.to_string()) .with_property("features", features) .build()); } // Add predicted contact edges (from GNN or co-evolution) let contact_probs = predict_contacts(&db, sequence); for (i, j, prob) in contact_probs { if prob > 0.5 { // Threshold db.add_edge(EdgeBuilder::new() .from(i).to(j) .with_label("Contact") .with_property("probability", prob) .with_property("seq_sep", ((j - i) as f32 / n as f32)) .build()); } } Self { db, protein_id: format!("protein_{}", uuid::Uuid::new_v4()) } } } fn encode_residue_features(aa: char, pos: usize, len: usize, msa: Option<&MultipleAlignment>) -> Vec { let mut features = vec![0.0; 41]; // One-hot amino acid (20-dim) let aa_idx = AA_TO_INDEX[&aa]; features[aa_idx] = 1.0; // Normalized position features[20] = pos as f32 / len as f32; // Conservation (from MSA if available) features[21] = msa.map(|m| m.conservation_at(pos)).unwrap_or(0.5); // MSA-derived coevolution features (16-dim) if let Some(m) = msa { let coevo = m.coevolution_features(pos); features[22..38].copy_from_slice(&coevo); } // Remaining features: secondary structure prediction, etc. features } ``` ### 2. Message Passing for Contact Prediction **Task**: Predict contact probability for all residue pairs. **Network architecture**: ```rust use ruvector_gnn::layer::{RuvectorLayer, Linear, LayerNorm, MultiHeadAttention}; use ruvector_gnn::optimizer::{Optimizer, OptimizerType}; pub struct ContactPredictor { layers: Vec, edge_predictor: Linear, norm: LayerNorm, hidden_dim: usize, } impl ContactPredictor { pub fn new(input_dim: usize, hidden_dim: usize, num_layers: usize, num_heads: usize) -> Self { let mut layers = Vec::new(); // First layer: input_dim -> hidden_dim layers.push(RuvectorLayer::new(input_dim, hidden_dim, num_heads, 0.1)); // Hidden layers: hidden_dim -> hidden_dim for _ in 1..num_layers { layers.push(RuvectorLayer::new(hidden_dim, hidden_dim, num_heads, 0.1)); } Self { layers, edge_predictor: Linear::new(hidden_dim * 2, 1), // Predict contact from pair norm: LayerNorm::new(hidden_dim, 1e-5), hidden_dim, } } pub fn forward( &self, node_features: &[Vec], edge_index: &[(usize, usize)], edge_weights: &[f32], ) -> Vec> { let mut h = node_features.to_vec(); // Message passing layers for layer in &self.layers { h = self.apply_layer(layer, &h, edge_index, edge_weights); } // Normalize final embeddings h.iter().map(|emb| self.norm.forward(emb)).collect() } fn apply_layer( &self, layer: &RuvectorLayer, node_features: &[Vec], edge_index: &[(usize, usize)], edge_weights: &[f32], ) -> Vec> { let n = node_features.len(); let mut outputs = Vec::with_capacity(n); for i in 0..n { // Gather neighbors let neighbors: Vec<_> = edge_index.iter() .enumerate() .filter(|(_, (src, _))| *src == i) .map(|(idx, (_, dst))| (*dst, edge_weights[idx])) .collect(); if neighbors.is_empty() { outputs.push(node_features[i].clone()); continue; } let neighbor_features: Vec<_> = neighbors.iter() .map(|(j, _)| node_features[*j].clone()) .collect(); let weights: Vec = neighbors.iter().map(|(_, w)| *w).collect(); // RuvectorLayer aggregates neighbors with attention let h_i = layer.forward(&node_features[i], &neighbor_features, &weights); outputs.push(h_i); } outputs } pub fn predict_contacts(&self, embeddings: &[Vec]) -> Vec<(usize, usize, f32)> { let mut contacts = Vec::new(); let n = embeddings.len(); for i in 0..n { for j in (i + 5)..n { // Only pairs with |i-j| >= 5 (long-range) // Concatenate pair embeddings let mut pair_emb = embeddings[i].clone(); pair_emb.extend_from_slice(&embeddings[j]); // Predict contact probability let logit = self.edge_predictor.forward(&pair_emb)[0]; let prob = 1.0 / (1.0 + (-logit).exp()); // Sigmoid if prob > 0.01 { // Only store confident predictions contacts.push((i, j, prob)); } } } contacts } } // Training loop pub fn train_contact_predictor( model: &mut ContactPredictor, train_proteins: &[Protein], num_epochs: usize, ) -> Result<()> { let mut optimizer = Optimizer::new(OptimizerType::Adam { lr: 0.001, beta1: 0.9, beta2: 0.999 }); for epoch in 0..num_epochs { let mut total_loss = 0.0; for protein in train_proteins { // Get node features, edges, ground truth contacts let node_features = protein.residue_features(); let edge_index = protein.sequence_edges(); // Sequential + MSA-based let edge_weights = vec![1.0; edge_index.len()]; // Forward pass let embeddings = model.forward(&node_features, &edge_index, &edge_weights); let predicted = model.predict_contacts(&embeddings); // Compute loss (binary cross-entropy on contacts) let ground_truth = protein.contact_map(); // From known structure let loss = bce_loss(&predicted, &ground_truth); // Backward pass (gradients computed manually or via autograd) // ... gradient computation ... // Optimizer step // optimizer.step(&mut model.parameters(), &gradients); total_loss += loss; } println!("Epoch {}: Loss = {:.4}", epoch, total_loss / train_proteins.len() as f32); } Ok(()) } ``` ### 3. Protein-Protein Interaction (PPI) Network **Goal**: Predict whether two proteins interact based on sequence, structure, and network topology. **Graph representation**: ``` G_PPI = (V_protein, E_interact, X_protein) V_protein = {p_1, ..., p_K} -- K proteins in the interactome X_protein in R^{K x d} -- Protein feature vectors (d=256) Features per protein: - ESM-2 sequence embedding (128-dim) - Gene Ontology terms (64-dim binary) - Subcellular localization (12-dim one-hot) - Expression profile (16-dim from GTEx) - Domain composition (36-dim Pfam fingerprint) ``` **Implementation**: ```rust pub struct PPIPredictor { encoder: RuvectorLayer, // Encode protein features gnn_layers: Vec, // Message passing over PPI graph link_predictor: Linear, // Predict interaction from pair embedding } impl PPIPredictor { pub fn new(input_dim: usize, hidden_dim: usize, num_layers: usize) -> Self { let encoder = RuvectorLayer::new(input_dim, hidden_dim, 8, 0.1); let mut gnn_layers = Vec::new(); for _ in 0..num_layers { gnn_layers.push(RuvectorLayer::new(hidden_dim, hidden_dim, 8, 0.1)); } let link_predictor = Linear::new(hidden_dim * 3, 1); // Concat + Hadamard Self { encoder, gnn_layers, link_predictor } } pub fn predict_interaction(&self, protein_i: &[f32], protein_j: &[f32], graph: &PPIGraph) -> f32 { // Encode proteins let h_i = self.encoder.forward(protein_i, &[], &[]); let h_j = self.encoder.forward(protein_j, &[], &[]); // Message passing (aggregate neighbor information) let h_i_agg = self.aggregate_neighbors(&h_i, graph.neighbors_of(protein_i)); let h_j_agg = self.aggregate_neighbors(&h_j, graph.neighbors_of(protein_j)); // Link prediction: [h_i || h_j || h_i ⊙ h_j] let mut pair_emb = h_i_agg.clone(); pair_emb.extend_from_slice(&h_j_agg); let hadamard: Vec = h_i_agg.iter().zip(&h_j_agg).map(|(a, b)| a * b).collect(); pair_emb.extend_from_slice(&hadamard); let logit = self.link_predictor.forward(&pair_emb)[0]; 1.0 / (1.0 + (-logit).exp()) // Sigmoid } fn aggregate_neighbors(&self, embedding: &[f32], neighbors: &[Vec]) -> Vec { if neighbors.is_empty() { return embedding.to_vec(); } let weights = vec![1.0; neighbors.len()]; let mut h = embedding.to_vec(); for layer in &self.gnn_layers { h = layer.forward(&h, neighbors, &weights); } h } } ``` ### 4. Integration with Genomic Attention Layers **Goal**: Connect variant effects to protein structure changes and interaction disruption. **Pipeline**: ```rust pub struct VariantToProteinPipeline { contact_model: ContactPredictor, ppi_model: PPIPredictor, } impl VariantToProteinPipeline { /// Predict how a missense variant affects protein structure pub fn predict_structural_impact(&self, gene: &str, variant: &Variant) -> StructuralImpact { // 1. Get protein sequence and apply variant let wt_seq = get_protein_sequence(gene); let mut mt_seq = wt_seq.clone(); mt_seq[variant.position] = variant.alt_aa; // 2. Predict contact maps for WT and mutant let wt_graph = ProteinContactGraph::from_sequence(&wt_seq, None); let mt_graph = ProteinContactGraph::from_sequence(&mt_seq, None); let wt_contacts = self.contact_model.predict_contacts(&wt_graph.embeddings()); let mt_contacts = self.contact_model.predict_contacts(&mt_graph.embeddings()); // 3. Compare contact maps let contact_change = compute_contact_difference(&wt_contacts, &mt_contacts); StructuralImpact { contact_disruption: contact_change, predicted_pathogenicity: if contact_change > 0.3 { "Pathogenic" } else { "Benign" }, } } /// Predict how a variant affects protein-protein interactions pub fn predict_interaction_impact(&self, gene: &str, variant: &Variant, interactors: &[String]) -> Vec { let mut changes = Vec::new(); let wt_features = get_protein_features(gene); let mut mt_features = wt_features.clone(); apply_variant_to_features(&mut mt_features, variant); for interactor in interactors { let partner_features = get_protein_features(interactor); let wt_score = self.ppi_model.predict_interaction(&wt_features, &partner_features, &ppi_graph); let mt_score = self.ppi_model.predict_interaction(&mt_features, &partner_features, &ppi_graph); changes.push(InteractionChange { partner: interactor.clone(), wt_score, mt_score, delta: mt_score - wt_score, }); } changes } } ``` --- ## Implementation Status ### ✅ What Works Today - **GNN message passing**: `RuvectorLayer` with multi-head attention and GRU updates - **Graph storage**: `ruvector-graph::GraphDB` for protein graphs - **Training infrastructure**: `Optimizer` with Adam, loss functions - **Linear transformations**: `Linear` layers for projections - **Layer normalization**: `LayerNorm` for stable training ### 🚧 What Needs Building - **SE(3) equivariance**: Coordinate-aware message passing requires extending `RuvectorLayer` to handle 3D positions. This needs a separate `EquivariantLayer` that maintains separate scalar (invariant) and vector (equivariant) channels. - **Protein feature encoders**: MSA processing, co-evolution calculation, ESM-2 embedding extraction - **Contact map evaluation**: Precision@L, precision@L/5 metrics for structure prediction - **PPI training data pipeline**: Integration with STRING, BioGRID, IntAct databases --- ## Performance Targets | Task | Target | Current Capability | |------|--------|-------------------| | Residue contact prediction (300 residues) | < 100 ms | ✅ Achievable with RuvectorLayer (8 layers) | | PPI prediction (single pair) | < 10 ms | ✅ Achievable with RuvectorLayer (3 layers) | | Variant structural impact | < 500 ms | ✅ Two forward passes + comparison | | Batch PPI prediction (1000 pairs) | < 5 seconds | ✅ Parallelizable with batch inference | --- ## SOTA Comparison | Method | Contact Precision@L | PPI AUROC | Handles Variants | |--------|-------------------|-----------|-----------------| | AlphaFold2 | **0.90** | N/A | ❌ | | ESMFold | 0.85 | N/A | ❌ | | ProteinMPNN | N/A | N/A | ❌ (inverse design) | | GearNet | 0.70 | 0.88 | ❌ | | **RuVector GNN** | 0.65-0.75 (target) | 0.80-0.85 (target) | ✅ | **RuVector advantage**: Native integration with variant calling pipeline (ADR-001-004), enabling real-time variant→structure→interaction effect prediction. --- ## Consequences ### Positive - **Native variant integration**: Directly connects genomic variants to protein-level effects - **Practical implementation**: Uses existing `ruvector-gnn` API without requiring new layers - **Interpretable**: Contact maps and PPI scores are clinically actionable - **Scalable**: Message passing scales to proteome-wide interaction networks ### Negative - **No SE(3) equivariance yet**: Current implementation doesn't guarantee rotation/translation invariance - **Lower accuracy than AlphaFold2**: Contact prediction is 10-15% below SOTA structure predictors - **Requires training data**: PPI and contact prediction need labeled protein structures and interaction databases ### Risks - **MSA dependency**: Contact prediction degrades without multiple sequence alignments - **PPI noise**: Experimental interaction databases have 20-30% false positive rate - **Generalization**: Models trained on human proteins may not transfer to pathogens --- ## References 1. Lin, Z. et al. (2023). "Evolutionary-scale prediction of atomic-level protein structure with a language model." *Science*, 379, 1123-1130. (ESMFold) 2. Jumper, J. et al. (2021). "Highly accurate protein structure prediction with AlphaFold." *Nature*, 596, 583-589. (AlphaFold2 Evoformer) 3. Dauparas, J. et al. (2022). "Robust deep learning-based protein sequence design using ProteinMPNN." *Science*, 378, 49-56. (ProteinMPNN) 4. Zhang, Z. et al. (2023). "Protein Representation Learning by Geometric Structure Pretraining." *ICLR 2023*. (GearNet) 5. Szklarczyk, D. et al. (2023). "The STRING database in 2023: protein-protein association networks and functional enrichment analyses." *Nucleic Acids Research*, 51(D1), D483-D489. (STRING PPI database) --- ## Related ADRs - **ADR-001**: RuVector Core Architecture (HNSW index for protein similarity) - **ADR-003**: Genomic Vector Index (variant embeddings feed into protein models) - **ADR-006**: Temporal Epigenomic Engine (integrates with gene expression changes)