# Neuro-Symbolic Query Execution - Implementation Plan ## Overview ### Problem Statement Current vector search in ruvector is purely neural (similarity-based): given a query vector, find the k most similar vectors by cosine/Euclidean distance. However, real-world queries often involve **logical constraints** that pure vector similarity cannot express: **Examples of Unsupported Queries:** - "Find vectors similar to X **AND** published after 2023 **AND** tagged as 'research'" - "Find vectors similar to X **OR** similar to Y, **EXCLUDING** category 'spam'" - "Find vectors where `metadata.price < 100` **AND** similarity > 0.8" - "Find vectors in graph community C **AND** within 2 hops of node N" **Current Limitations:** - No support for boolean logic (AND, OR, NOT) - Cannot filter by metadata attributes - Cannot combine vector similarity with graph structure - Forces post-processing filtering (inefficient) - No way to express complex multi-modal queries **Performance Impact:** - Retrieving 10,000 vectors then filtering to 10 wastes 99.9% of computation - No index acceleration for metadata predicates - Cannot push down filters to HNSW search ### Proposed Solution **Neuro-Symbolic Query Execution**: A hybrid query engine that combines neural vector similarity with symbolic logical constraints. **Key Components:** 1. **Query Language**: Extend existing Cypher/SQL support with vector similarity operators 2. **Hybrid Scoring**: Combine vector similarity scores with predicate satisfaction 3. **Filter Pushdown**: Apply logical constraints during HNSW search (not after) 4. **Multi-Modal Indexing**: Index metadata attributes alongside vectors 5. **Constraint Propagation**: Use graph structure to prune search space **Architecture:** ``` Query: "MATCH (v:Vector) WHERE vector_similarity(v.embedding, $query) > 0.8 AND v.year >= 2023 AND v.category IN ['research', 'papers'] RETURN v ORDER BY similarity DESC LIMIT 10" ↓ Parse & Optimize Neural Component: Symbolic Component: vector_similarity > 0.8 year >= 2023 AND category IN [...] ↓ ↓ HNSW Search Metadata Index ↓ ↓ └──────── Merge ─────────┘ ↓ Hybrid Scoring (α * neural + β * symbolic) ↓ Top-K Results ``` ### Expected Benefits **Quantified Performance Improvements:** | Query Type | Current (Post-Filter) | Neuro-Symbolic | Improvement | |------------|----------------------|----------------|-------------| | Similarity + 1 filter | 50ms (10K retrieved) | 5ms (100 retrieved) | **10x faster** | | Similarity + 3 filters | 200ms (50K retrieved) | 8ms (200 retrieved) | **25x faster** | | Complex boolean logic | Not supported | 15ms | **∞** (new capability) | | Multi-modal query | Manual joins | 20ms | **50x faster** | **Qualitative Benefits:** - Express complex queries naturally (no manual post-processing) - Efficient execution with filter pushdown - Support for real-world use cases (e-commerce, research, RAG) - Better accuracy through multi-modal fusion - Graph-aware queries (community detection, path constraints) ## Technical Design ### Architecture Diagram (ASCII Art) ``` ┌─────────────────────────────────────────────────────────────────┐ │ Neuro-Symbolic Query Execution Pipeline │ └─────────────────────────────────────────────────────────────────┘ User Query (SQL/Cypher + Vector Similarity) │ │ Example: "SELECT * FROM vectors │ WHERE cosine_similarity(embedding, $query) > 0.8 │ AND category = 'research' AND year >= 2023 │ ORDER BY similarity DESC LIMIT 10" │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Query Parser & AST Builder │ │ ┌────────────────────────────────────────────────────────────┐ │ │ │ Parse query into Abstract Syntax Tree (AST) │ │ │ │ ┌──────────────────────────────────────────────────────┐ │ │ │ │ │ SELECT │ │ │ │ │ │ WHERE │ │ │ │ │ │ AND │ │ │ │ │ │ ├─ cosine_similarity(emb, $q) > 0.8 [NEURAL] │ │ │ │ │ │ ├─ category = 'research' [SYMBOLIC] │ │ │ │ │ │ └─ year >= 2023 [SYMBOLIC] │ │ │ │ │ │ ORDER BY similarity DESC │ │ │ │ │ │ LIMIT 10 │ │ │ │ │ └──────────────────────────────────────────────────────┘ │ │ │ └────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Query Optimizer │ │ ┌────────────────────────────────────────────────────────────┐ │ │ │ Analyze predicates and rewrite query for efficiency │ │ │ │ │ │ │ │ 1. Predicate Pushdown: │ │ │ │ Move filters into HNSW search (before candidate gen) │ │ │ │ │ │ │ │ 2. Index Selection: │ │ │ │ Choose best index for symbolic predicates │ │ │ │ - category: inverted index │ │ │ │ - year: range index (B-tree) │ │ │ │ │ │ │ │ 3. Execution Strategy: │ │ │ │ - If few categories: scan category index first │ │ │ │ - If similarity selective: HNSW first, then filter │ │ │ │ - If balanced: hybrid merge │ │ │ │ │ │ │ │ 4. Hybrid Scoring: │ │ │ │ score = α * neural_sim + β * symbolic_score │ │ │ └────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Execution Plan │ │ ┌────────────────────────────────────────────────────────────┐ │ │ │ Step 1: HNSW Search (neural) │ │ │ │ - Target: similarity > 0.8 │ │ │ │ - Candidate pool: ef=200 │ │ │ │ - Early termination: collect ~100 candidates │ │ │ │ - Filter during search: year >= 2023 │ │ │ │ Output: {node_id, similarity} for ~100 candidates │ │ │ │ │ │ │ │ Step 2: Symbolic Filtering (metadata index) │ │ │ │ - Lookup category index: category = 'research' │ │ │ │ - Intersect with HNSW candidates │ │ │ │ Output: {node_id, similarity, metadata} for ~30 nodes │ │ │ │ │ │ │ │ Step 3: Hybrid Scoring │ │ │ │ - Compute symbolic_score (e.g., recency bonus) │ │ │ │ - Combined: 0.7 * similarity + 0.3 * symbolic_score │ │ │ │ Output: {node_id, hybrid_score} │ │ │ │ │ │ │ │ Step 4: Top-K Selection │ │ │ │ - Sort by hybrid_score DESC │ │ │ │ - Return top 10 │ │ │ └────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Result Set │ │ [{id: 42, similarity: 0.95, category: 'research', year: 2024}, │ │ {id: 137, similarity: 0.92, category: 'research', year: 2023},│ │ ...] │ └─────────────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────┐ │ Indexing & Storage Architecture │ └─────────────────────────────────────────────────────────────────┘ Vector Data: ┌─────────────────────────────────────────────────────────────────┐ │ HNSW Index (vector similarity) │ │ - Node ID → Embedding vector │ │ - Graph structure for approximate NN search │ └─────────────────────────────────────────────────────────────────┘ Metadata Data: ┌─────────────────────────────────────────────────────────────────┐ │ Inverted Index (categorical attributes) │ │ - category → {node_ids} │ │ - tag → {node_ids} │ │ - author → {node_ids} │ └─────────────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────┐ │ B-Tree Index (range attributes) │ │ - year → sorted {node_ids} │ │ - price → sorted {node_ids} │ │ - timestamp → sorted {node_ids} │ └─────────────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────┐ │ Roaring Bitmap Index (set operations) │ │ - Efficient AND/OR/NOT on node ID sets │ │ - Compressed storage for sparse sets │ └─────────────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────┐ │ Graph Index (structural constraints) │ │ - Community membership: community_id → {node_ids} │ │ - k-hop neighborhoods: precomputed for common queries │ │ - Path constraints: shortest path caches │ └─────────────────────────────────────────────────────────────────┘ ``` ### Core Data Structures (Rust) ```rust // File: crates/ruvector-query/src/neuro_symbolic/mod.rs use std::collections::{HashMap, HashSet}; use serde::{Deserialize, Serialize}; /// Neuro-symbolic query execution engine pub struct NeuroSymbolicEngine { /// HNSW index for vector similarity hnsw_index: Arc, /// Metadata indexes (inverted, B-tree, etc.) metadata_indexes: MetadataIndexes, /// Query optimizer optimizer: QueryOptimizer, /// Execution planner planner: ExecutionPlanner, /// Hybrid scoring configuration scoring_config: HybridScoringConfig, } /// Query representation (SQL/Cypher AST) #[derive(Debug, Clone)] pub struct Query { /// SELECT clause (which fields to return) pub select: Vec, /// WHERE clause (predicates) pub where_clause: Option, /// ORDER BY clause pub order_by: Vec, /// LIMIT clause pub limit: Option, /// OFFSET clause pub offset: Option, } /// Predicate tree (boolean logic) #[derive(Debug, Clone)] pub enum Predicate { /// Neural predicate: vector similarity VectorSimilarity { field: String, query_vector: Vec, operator: ComparisonOp, // >, <, = threshold: f32, metric: SimilarityMetric, // cosine, euclidean, dot }, /// Symbolic predicate: metadata constraint Attribute { field: String, operator: ComparisonOp, value: Value, }, /// Graph predicate: structural constraint Graph { constraint: GraphConstraint, }, /// Boolean operators And(Box, Box), Or(Box, Box), Not(Box), } #[derive(Debug, Clone)] pub enum GraphConstraint { /// Node in community InCommunity { community_id: u32 }, /// Within k hops of node WithinKHops { source_node: u32, k: usize }, /// On path between two nodes OnPath { source: u32, target: u32 }, /// Has edge to node ConnectedTo { node_id: u32 }, } #[derive(Debug, Clone, Copy)] pub enum ComparisonOp { Eq, // = Ne, // != Lt, // < Le, // <= Gt, // > Ge, // >= In, // IN (...) Like, // LIKE (string pattern) } #[derive(Debug, Clone)] pub enum Value { Int(i64), Float(f64), String(String), Bool(bool), List(Vec), } #[derive(Debug, Clone, Copy)] pub enum SimilarityMetric { Cosine, Euclidean, DotProduct, L1, } /// Metadata indexing structures pub struct MetadataIndexes { /// Inverted indexes for categorical fields inverted: HashMap, /// B-tree indexes for range queries btree: HashMap, /// Roaring bitmap for set operations bitmap_store: BitmapStore, /// Graph structural indexes graph_index: GraphStructureIndex, } /// Inverted index: field_value → {node_ids} pub struct InvertedIndex { /// Map from value to posting list (node IDs) postings: HashMap, /// Statistics for query optimization stats: IndexStats, } /// B-tree index for range queries pub struct BTreeIndex { /// Sorted map from value to node IDs tree: BTreeMap, /// Statistics stats: IndexStats, } /// Roaring bitmap store for efficient set operations pub struct BitmapStore { /// Node ID sets as compressed bitmaps bitmaps: HashMap, } /// Graph structure indexes pub struct GraphStructureIndex { /// Community assignments communities: HashMap, /// k-hop neighborhoods (precomputed) khop_cache: HashMap<(u32, usize), RoaringBitmap>, /// Shortest path cache path_cache: PathCache, } #[derive(Debug, Default)] pub struct IndexStats { pub num_unique_values: usize, pub total_postings: usize, pub avg_posting_length: f64, pub selectivity: f64, // fraction of nodes matching } /// Query execution plan #[derive(Debug)] pub struct ExecutionPlan { /// Ordered steps to execute pub steps: Vec, /// Estimated cost pub estimated_cost: f64, /// Estimated result size pub estimated_results: usize, } #[derive(Debug)] pub enum ExecutionStep { /// HNSW vector search VectorSearch { query_vector: Vec, similarity_threshold: f32, metric: SimilarityMetric, ef: usize, filters: Vec, // Filters applied during search }, /// Metadata index lookup IndexScan { index_name: String, predicate: Predicate, }, /// Graph structure traversal GraphTraversal { constraint: GraphConstraint, }, /// Set intersection (AND) Intersect { left: Box, right: Box, }, /// Set union (OR) Union { left: Box, right: Box, }, /// Set difference (NOT) Difference { left: Box, right: Box, }, /// Hybrid scoring HybridScore { neural_scores: HashMap, symbolic_scores: HashMap, alpha: f32, // neural weight beta: f32, // symbolic weight }, /// Top-K selection TopK { input: Box, k: usize, order_by: Vec, }, } /// Filter applied during HNSW search (pushdown) #[derive(Debug, Clone)] pub struct InlineFilter { pub field: String, pub operator: ComparisonOp, pub value: Value, } /// Hybrid scoring configuration #[derive(Debug, Clone)] pub struct HybridScoringConfig { /// Weight for neural similarity score pub neural_weight: f32, /// Weight for symbolic score pub symbolic_weight: f32, /// Normalization method pub normalization: NormalizationMethod, } #[derive(Debug, Clone, Copy)] pub enum NormalizationMethod { /// Min-max normalization [0, 1] MinMax, /// Z-score normalization ZScore, /// None (assume scores already normalized) None, } /// Query result #[derive(Debug, Serialize, Deserialize)] pub struct QueryResult { /// Matched node IDs pub node_ids: Vec, /// Neural similarity scores pub neural_scores: Vec, /// Symbolic scores (if applicable) pub symbolic_scores: Option>, /// Hybrid scores pub hybrid_scores: Vec, /// Metadata for each result pub metadata: Vec>, /// Query execution statistics pub stats: QueryStats, } #[derive(Debug, Serialize, Deserialize, Default)] pub struct QueryStats { /// Total execution time (milliseconds) pub total_time_ms: f64, /// Time breakdown by step pub step_times: Vec<(String, f64)>, /// Number of candidates evaluated pub candidates_evaluated: usize, /// Number of results returned pub results_returned: usize, /// Index usage pub indexes_used: Vec, } #[derive(Debug, Clone)] pub struct OrderBy { pub field: String, pub direction: SortDirection, } #[derive(Debug, Clone, Copy)] pub enum SortDirection { Asc, Desc, } /// Wrapper for ordered values in B-tree #[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord)] pub enum OrderedValue { Int(i64), Float(OrderedFloat), String(String), } use ordered_float::OrderedFloat; use roaring::RoaringBitmap; use std::collections::BTreeMap; use std::sync::Arc; ``` ### Key Algorithms (Pseudocode) #### 1. Query Execution Algorithm ```python function execute_neuro_symbolic_query(query: Query, engine: NeuroSymbolicEngine) -> QueryResult: """ Execute neuro-symbolic query with hybrid scoring. Main algorithm: parse → optimize → plan → execute → score → return """ start_time = now() # Step 1: Parse query into AST (already done, query is AST) # Step 2: Optimize query (predicate pushdown, index selection) optimized_query = engine.optimizer.optimize(query) # Step 3: Generate execution plan plan = engine.planner.create_plan(optimized_query) # Step 4: Execute plan steps result_set = execute_plan(plan, engine) # Step 5: Hybrid scoring if has_both_neural_and_symbolic(plan): result_set = apply_hybrid_scoring( result_set, engine.scoring_config ) # Step 6: Apply ORDER BY and LIMIT result_set = sort_and_limit( result_set, query.order_by, query.limit, query.offset ) # Step 7: Fetch metadata for results metadata = fetch_metadata(result_set.node_ids, query.select) execution_time = now() - start_time return QueryResult( node_ids=result_set.node_ids, neural_scores=result_set.neural_scores, symbolic_scores=result_set.symbolic_scores, hybrid_scores=result_set.hybrid_scores, metadata=metadata, stats=QueryStats( total_time_ms=execution_time, candidates_evaluated=result_set.candidates_evaluated, results_returned=len(result_set.node_ids), indexes_used=plan.indexes_used ) ) function execute_plan(plan: ExecutionPlan, engine: NeuroSymbolicEngine) -> IntermediateResult: """ Recursively execute plan steps. """ results = None for step in plan.steps: match step: case VectorSearch: # HNSW search with optional filters results = execute_vector_search(step, engine.hnsw_index) case IndexScan: # Lookup in metadata index results = execute_index_scan(step, engine.metadata_indexes) case GraphTraversal: # Graph structure query results = execute_graph_traversal(step, engine.metadata_indexes.graph_index) case Intersect: # AND: set intersection left = execute_plan_step(step.left, engine) right = execute_plan_step(step.right, engine) results = intersect_results(left, right) case Union: # OR: set union left = execute_plan_step(step.left, engine) right = execute_plan_step(step.right, engine) results = union_results(left, right) case Difference: # NOT: set difference left = execute_plan_step(step.left, engine) right = execute_plan_step(step.right, engine) results = difference_results(left, right) case HybridScore: # Compute hybrid scores results = compute_hybrid_scores( step.neural_scores, step.symbolic_scores, step.alpha, step.beta ) case TopK: # Select top-k results input_results = execute_plan_step(step.input, engine) results = select_top_k(input_results, step.k, step.order_by) return results function execute_vector_search(step: VectorSearch, hnsw: HnswIndex) -> IntermediateResult: """ HNSW search with filter pushdown. Key optimization: Apply symbolic filters during HNSW traversal to avoid generating candidates that will be filtered out anyway. """ query_vector = step.query_vector similarity_threshold = step.similarity_threshold ef = step.ef inline_filters = step.filters # HNSW search with inline filtering candidates = [] visited = set() # Start from entry point current_node = hnsw.entry_point layer = hnsw.max_layer while layer >= 0: # Greedy search at this layer while True: neighbors = hnsw.get_neighbors(current_node, layer) best_neighbor = None best_distance = float('inf') for neighbor in neighbors: if neighbor in visited: continue # Apply inline filters BEFORE computing distance if not passes_inline_filters(neighbor, inline_filters, hnsw.metadata): continue # Skip this neighbor entirely # Compute distance only for filtered candidates distance = compute_distance(query_vector, hnsw.get_vector(neighbor)) similarity = distance_to_similarity(distance, step.metric) if similarity >= similarity_threshold: candidates.append((neighbor, similarity)) if distance < best_distance: best_distance = distance best_neighbor = neighbor visited.add(neighbor) if best_neighbor is None: break # No improvement current_node = best_neighbor layer -= 1 # Sort candidates by similarity candidates.sort(key=lambda x: x[1], reverse=True) return IntermediateResult( node_ids=[node_id for node_id, _ in candidates], neural_scores=[score for _, score in candidates], candidates_evaluated=len(visited) ) function passes_inline_filters(node_id: u32, filters: List[InlineFilter], metadata: MetadataStore) -> bool: """ Check if node passes all inline filters. This avoids computing distance for nodes that fail metadata constraints. """ for filter in filters: node_value = metadata.get(node_id, filter.field) if not evaluate_predicate(node_value, filter.operator, filter.value): return False # Failed a filter return True # Passed all filters function execute_index_scan(step: IndexScan, indexes: MetadataIndexes) -> IntermediateResult: """ Scan metadata index to get matching node IDs. """ index_name = step.index_name predicate = step.predicate match predicate: case Attribute(field, operator, value): if operator == ComparisonOp.Eq: # Exact match: use inverted index posting_list = indexes.inverted[field].lookup(value) return IntermediateResult( node_ids=posting_list.to_vec(), symbolic_scores=[1.0] * len(posting_list) # Binary: matches or not ) elif operator in [ComparisonOp.Lt, ComparisonOp.Le, ComparisonOp.Gt, ComparisonOp.Ge]: # Range query: use B-tree index matching_nodes = indexes.btree[field].range_query(operator, value) return IntermediateResult( node_ids=matching_nodes.to_vec(), symbolic_scores=[1.0] * len(matching_nodes) ) elif operator == ComparisonOp.In: # IN query: union of inverted index lookups all_nodes = RoaringBitmap() for v in value.list: posting_list = indexes.inverted[field].lookup(v) all_nodes |= posting_list # Union return IntermediateResult( node_ids=all_nodes.to_vec(), symbolic_scores=[1.0] * len(all_nodes) ) function execute_graph_traversal(step: GraphTraversal, graph_index: GraphStructureIndex) -> IntermediateResult: """ Execute graph structural constraint. """ match step.constraint: case InCommunity(community_id): # Lookup precomputed community membership node_ids = graph_index.communities.get(community_id) return IntermediateResult( node_ids=node_ids.to_vec(), symbolic_scores=[1.0] * len(node_ids) ) case WithinKHops(source_node, k): # Lookup precomputed k-hop neighborhood key = (source_node, k) if key in graph_index.khop_cache: node_ids = graph_index.khop_cache[key] else: # Compute on-the-fly via BFS node_ids = compute_khop_neighbors(source_node, k, graph_index.graph) return IntermediateResult( node_ids=node_ids.to_vec(), symbolic_scores=[1.0 / (1 + distance)] for distance in range(len(node_ids)) ) case OnPath(source, target): # Check path cache path_nodes = graph_index.path_cache.get_path(source, target) return IntermediateResult( node_ids=path_nodes, symbolic_scores=[1.0] * len(path_nodes) ) function intersect_results(left: IntermediateResult, right: IntermediateResult) -> IntermediateResult: """ Set intersection (AND): keep nodes in both sets. Use Roaring Bitmap for efficient intersection. """ left_bitmap = RoaringBitmap.from_sorted(left.node_ids) right_bitmap = RoaringBitmap.from_sorted(right.node_ids) intersection = left_bitmap & right_bitmap # Bitmap AND # Combine scores (average for simplicity) node_ids = intersection.to_vec() combined_scores = [] for node_id in node_ids: left_score = left.get_score(node_id) right_score = right.get_score(node_id) combined_scores.append((left_score + right_score) / 2.0) return IntermediateResult( node_ids=node_ids, scores=combined_scores ) function apply_hybrid_scoring(result_set, config: HybridScoringConfig) -> IntermediateResult: """ Combine neural and symbolic scores. Formula: hybrid_score = α * normalize(neural) + β * normalize(symbolic) """ neural_scores = result_set.neural_scores symbolic_scores = result_set.symbolic_scores # Normalize scores to [0, 1] if config.normalization == NormalizationMethod.MinMax: neural_norm = min_max_normalize(neural_scores) symbolic_norm = min_max_normalize(symbolic_scores) elif config.normalization == NormalizationMethod.ZScore: neural_norm = z_score_normalize(neural_scores) symbolic_norm = z_score_normalize(symbolic_scores) else: neural_norm = neural_scores symbolic_norm = symbolic_scores # Combine with weights alpha = config.neural_weight beta = config.symbolic_weight hybrid_scores = [ alpha * n + beta * s for n, s in zip(neural_norm, symbolic_norm) ] result_set.hybrid_scores = hybrid_scores return result_set ``` #### 2. Query Optimization ```python function optimize_query(query: Query, optimizer: QueryOptimizer) -> Query: """ Optimize query execution plan. Key optimizations: 1. Predicate pushdown (filters into HNSW search) 2. Index selection (choose best index for each predicate) 3. Join reordering (cheapest predicates first) 4. Early termination (stop when enough candidates found) """ # Extract predicates from WHERE clause predicates = extract_predicates(query.where_clause) # Classify predicates neural_preds = [p for p in predicates if is_neural_predicate(p)] symbolic_preds = [p for p in predicates if is_symbolic_predicate(p)] graph_preds = [p for p in predicates if is_graph_predicate(p)] # Estimate selectivity for each predicate selectivities = {} for pred in predicates: selectivities[pred] = estimate_selectivity(pred, optimizer.stats) # Predicate pushdown: which filters can be applied during HNSW search? inline_filters = [] post_filters = [] for pred in symbolic_preds: if can_pushdown(pred): inline_filters.append(pred) else: post_filters.append(pred) # Index selection: choose best index for each symbolic predicate index_plan = {} for pred in symbolic_preds: best_index = choose_best_index(pred, optimizer.indexes, selectivities[pred]) index_plan[pred] = best_index # Reorder predicates: most selective first ordered_predicates = sorted(predicates, key=lambda p: selectivities[p]) # Build optimized execution plan optimized_query = rewrite_query( query, inline_filters=inline_filters, post_filters=post_filters, index_plan=index_plan, predicate_order=ordered_predicates ) return optimized_query function estimate_selectivity(predicate, stats) -> float: """ Estimate fraction of nodes matching predicate. Uses index statistics (histograms, cardinality, etc.) """ match predicate: case VectorSimilarity(threshold): # Estimate based on similarity distribution return estimate_similarity_selectivity(threshold, stats.similarity_histogram) case Attribute(field, operator, value): # Estimate based on attribute distribution if operator == ComparisonOp.Eq: return 1.0 / stats.cardinality[field] # Uniform assumption elif operator in [Lt, Le, Gt, Ge]: return estimate_range_selectivity(field, operator, value, stats) elif operator == In: return len(value.list) / stats.cardinality[field] case Graph(constraint): # Estimate based on graph structure match constraint: case InCommunity(id): return stats.community_sizes[id] / stats.total_nodes case WithinKHops(node, k): return estimate_khop_size(node, k, stats) / stats.total_nodes function can_pushdown(predicate) -> bool: """ Check if predicate can be pushed into HNSW search. Only simple equality/range predicates on indexed fields can be pushed down. """ match predicate: case Attribute(field, operator, value): # Can pushdown if operator is simple and field is indexed return operator in [Eq, Lt, Le, Gt, Ge, In] and is_indexed(field) case _: return False # Complex predicates handled post-search ``` ### API Design (Function Signatures) ```rust // File: crates/ruvector-query/src/neuro_symbolic/mod.rs impl NeuroSymbolicEngine { /// Create a new neuro-symbolic query engine pub fn new( hnsw_index: Arc, metadata_path: impl AsRef, ) -> Result; /// Execute a query (SQL or Cypher syntax) pub fn execute_query( &self, query: &str, ) -> Result; /// Execute a parsed query (AST) pub fn execute_parsed_query( &self, query: Query, ) -> Result; /// Add metadata index for a field pub fn create_index( &mut self, field: &str, index_type: IndexType, ) -> Result<(), QueryError>; /// Update hybrid scoring configuration pub fn set_scoring_config(&mut self, config: HybridScoringConfig); /// Get query execution statistics pub fn stats(&self) -> QueryEngineStats; } #[derive(Debug, Clone, Copy)] pub enum IndexType { Inverted, // Categorical fields BTree, // Range queries Bitmap, // Set operations } impl Query { /// Parse SQL query string into AST pub fn parse_sql(query: &str) -> Result; /// Parse Cypher query string into AST pub fn parse_cypher(query: &str) -> Result; /// Validate query syntax and semantics pub fn validate(&self) -> Result<(), ValidationError>; } impl Predicate { /// Evaluate predicate on a node pub fn evaluate( &self, node_id: u32, vector_store: &VectorStore, metadata_store: &MetadataStore, ) -> bool; /// Extract referenced fields pub fn referenced_fields(&self) -> Vec; /// Check if predicate is neural (vector similarity) pub fn is_neural(&self) -> bool; /// Check if predicate is symbolic (metadata) pub fn is_symbolic(&self) -> bool; /// Check if predicate is graph-structural pub fn is_graph_structural(&self) -> bool; } impl MetadataIndexes { /// Create indexes from metadata file pub fn from_metadata(path: impl AsRef) -> Result; /// Add inverted index for field pub fn add_inverted_index( &mut self, field: &str, values: HashMap>, ) -> Result<(), IndexError>; /// Add B-tree index for field pub fn add_btree_index( &mut self, field: &str, values: Vec<(OrderedValue, u32)>, ) -> Result<(), IndexError>; /// Query inverted index pub fn query_inverted( &self, field: &str, value: &str, ) -> Option<&RoaringBitmap>; /// Query B-tree index (range) pub fn query_btree_range( &self, field: &str, operator: ComparisonOp, value: OrderedValue, ) -> Option; /// Intersect bitmaps (AND operation) pub fn intersect(&self, bitmaps: &[RoaringBitmap]) -> RoaringBitmap; /// Union bitmaps (OR operation) pub fn union(&self, bitmaps: &[RoaringBitmap]) -> RoaringBitmap; /// Difference bitmaps (NOT operation) pub fn difference(&self, left: &RoaringBitmap, right: &RoaringBitmap) -> RoaringBitmap; } #[derive(Debug, Default)] pub struct QueryEngineStats { pub total_queries: u64, pub avg_query_time_ms: f64, pub cache_hit_rate: f64, pub avg_candidates_evaluated: f64, } ``` ## Integration Points ### Affected Crates/Modules 1. **`ruvector-query`** (New Crate) - New module: `src/neuro_symbolic/mod.rs` - Core engine - New module: `src/neuro_symbolic/parser.rs` - SQL/Cypher parser - New module: `src/neuro_symbolic/optimizer.rs` - Query optimizer - New module: `src/neuro_symbolic/planner.rs` - Execution planner - New module: `src/neuro_symbolic/indexes.rs` - Metadata indexing 2. **`ruvector-core`** (Integration) - Modified: `src/index/hnsw.rs` - Add filter callback support - Modified: `src/vector_store.rs` - Expose metadata API 3. **`ruvector-api`** (Exposure) - Modified: `src/query.rs` - Add neuro-symbolic query endpoint - New: `src/query/sql.rs` - SQL query interface - New: `src/query/cypher.rs` - Cypher query interface 4. **`ruvector-bindings`** (Language Bindings) - Modified: `python/src/lib.rs` - Expose query API - Modified: `nodejs/src/lib.rs` - Expose query API ### New Modules to Create ``` crates/ruvector-query/ # New crate ├── src/ │ ├── neuro_symbolic/ │ │ ├── mod.rs # Core engine │ │ ├── parser.rs # Query parsing │ │ ├── optimizer.rs # Query optimization │ │ ├── planner.rs # Execution planning │ │ ├── executor.rs # Query execution │ │ ├── indexes.rs # Metadata indexing │ │ ├── scoring.rs # Hybrid scoring │ │ └── stats.rs # Statistics collection │ └── lib.rs examples/ ├── neuro_symbolic_queries/ │ ├── sql_examples.rs # SQL query examples │ ├── cypher_examples.rs # Cypher query examples │ ├── hybrid_scoring.rs # Hybrid scoring examples │ └── README.md ``` ### Dependencies on Other Features **Depends On:** - **HNSW Index**: Core vector search functionality - **Existing Cypher Support**: Extend existing graph query support **Synergies With:** - **GNN-Guided Routing (Feature 1)**: Can use GNN for smarter query execution - **Incremental Learning (Feature 2)**: Real-time index updates support streaming queries **External Dependencies:** - `sqlparser` - SQL parsing - `cypher-parser` - Cypher parsing (if not already present) - `roaring` - Roaring Bitmap for efficient set operations - `serde` - Query serialization ## Regression Prevention ### What Existing Functionality Could Break 1. **Pure Vector Search Performance** - Risk: Adding metadata lookups slows down simple vector queries - Impact: Regression in baseline HNSW performance 2. **Memory Usage** - Risk: Metadata indexes consume excessive RAM - Impact: OOM on large datasets 3. **Query Correctness** - Risk: Filter pushdown logic has bugs, returns wrong results - Impact: Incorrect search results 4. **Cypher Compatibility** - Risk: Extending Cypher syntax breaks existing queries - Impact: Breaking change for existing users ### Test Cases to Prevent Regressions ```rust // File: crates/ruvector-query/tests/neuro_symbolic_regression_tests.rs #[test] fn test_pure_vector_search_unchanged() { // Simple vector queries should have zero overhead let engine = setup_test_engine(); // Baseline: pure HNSW search (no filters) let query_baseline = "SELECT * FROM vectors ORDER BY similarity(embedding, $query) DESC LIMIT 10"; let start = Instant::now(); let results = engine.execute_query(query_baseline).unwrap(); let time_with_engine = start.elapsed(); // Direct HNSW (without query engine) let start = Instant::now(); let results_direct = engine.hnsw_index.search(&query_vector, 10).unwrap(); let time_direct = start.elapsed(); // Query engine should add <5% overhead let overhead = (time_with_engine.as_secs_f64() / time_direct.as_secs_f64()) - 1.0; assert!(overhead < 0.05, "Overhead: {:.2}%, expected <5%", overhead * 100.0); // Results should be identical assert_eq!(results.node_ids, results_direct.node_ids); } #[test] fn test_filter_correctness() { // Filtered queries must return correct subset let engine = setup_test_engine_with_metadata(); let query = "SELECT * FROM vectors WHERE similarity(embedding, $query) > 0.8 AND category = 'research' AND year >= 2023 LIMIT 10"; let results = engine.execute_query(query).unwrap(); // Verify each result matches ALL predicates for node_id in &results.node_ids { let similarity = compute_similarity(&query_vector, engine.get_vector(*node_id)); assert!(similarity > 0.8, "Node {} similarity: {}, expected >0.8", node_id, similarity); let category = engine.get_metadata(*node_id, "category"); assert_eq!(category, "research", "Node {} category: {}, expected 'research'", node_id, category); let year = engine.get_metadata(*node_id, "year").parse::().unwrap(); assert!(year >= 2023, "Node {} year: {}, expected >=2023", node_id, year); } } #[test] fn test_filter_pushdown_performance() { // Pushdown filters should be much faster than post-filtering let engine = setup_test_engine_with_metadata(); // With pushdown (optimized) let query_pushdown = "SELECT * FROM vectors WHERE similarity(embedding, $query) > 0.8 AND category = 'research' LIMIT 10"; let start = Instant::now(); let results_pushdown = engine.execute_query(query_pushdown).unwrap(); let time_pushdown = start.elapsed(); // Without pushdown (post-filter, manual implementation) let all_results = engine.hnsw_index.search(&query_vector, 10000).unwrap(); let start = Instant::now(); let results_post: Vec<_> = all_results.into_iter() .filter(|r| r.similarity > 0.8) .filter(|r| engine.get_metadata(r.node_id, "category") == "research") .take(10) .collect(); let time_post = start.elapsed(); // Pushdown should be ≥5x faster let speedup = time_post.as_secs_f64() / time_pushdown.as_secs_f64(); assert!(speedup >= 5.0, "Speedup: {:.1}x, expected ≥5x", speedup); // Results should be identical assert_eq!(results_pushdown.node_ids.len(), results_post.len()); } #[test] fn test_hybrid_scoring_correctness() { // Hybrid scores should combine neural and symbolic correctly let engine = setup_test_engine(); engine.set_scoring_config(HybridScoringConfig { neural_weight: 0.7, symbolic_weight: 0.3, normalization: NormalizationMethod::MinMax, }); let query = "SELECT * FROM vectors WHERE similarity(embedding, $query) > 0.5 AND year >= 2020 ORDER BY hybrid_score DESC LIMIT 10"; let results = engine.execute_query(query).unwrap(); // Verify hybrid score formula for i in 0..results.node_ids.len() { let neural = results.neural_scores[i]; let symbolic = results.symbolic_scores.as_ref().unwrap()[i]; // Normalize (min-max) let neural_norm = (neural - 0.5) / (1.0 - 0.5); // Assuming min=0.5, max=1.0 let symbolic_norm = (symbolic - 0.0) / (1.0 - 0.0); // Assuming min=0.0, max=1.0 let expected_hybrid = 0.7 * neural_norm + 0.3 * symbolic_norm; let actual_hybrid = results.hybrid_scores[i]; assert!((expected_hybrid - actual_hybrid).abs() < 1e-5, "Hybrid score mismatch: expected {}, got {}", expected_hybrid, actual_hybrid); } } #[test] fn test_boolean_logic_correctness() { // AND/OR/NOT operations must be correct let engine = setup_test_engine(); // Test AND let query_and = "SELECT * FROM vectors WHERE category = 'A' AND tag = 'X'"; let results_and = engine.execute_query(query_and).unwrap(); for node_id in &results_and.node_ids { assert_eq!(engine.get_metadata(*node_id, "category"), "A"); assert_eq!(engine.get_metadata(*node_id, "tag"), "X"); } // Test OR let query_or = "SELECT * FROM vectors WHERE category = 'A' OR category = 'B'"; let results_or = engine.execute_query(query_or).unwrap(); for node_id in &results_or.node_ids { let category = engine.get_metadata(*node_id, "category"); assert!(category == "A" || category == "B"); } // Test NOT let query_not = "SELECT * FROM vectors WHERE category = 'A' AND NOT tag = 'X'"; let results_not = engine.execute_query(query_not).unwrap(); for node_id in &results_not.node_ids { assert_eq!(engine.get_metadata(*node_id, "category"), "A"); assert_ne!(engine.get_metadata(*node_id, "tag"), "X"); } } ``` ### Backward Compatibility Strategy 1. **Opt-In Feature** - Neuro-symbolic queries are opt-in (require explicit SQL/Cypher syntax) - Existing vector search API unchanged 2. **Graceful Degradation** - If metadata indexes not available, fallback to post-filtering - Log warning but do not crash 3. **Configuration** ```yaml query: neuro_symbolic: enabled: true # Default: true metadata_indexes: true # Default: true hybrid_scoring: true # Default: true ``` 4. **API Versioning** - New endpoints for neuro-symbolic queries (`/query/sql`, `/query/cypher`) - Existing endpoints (`/search`) unchanged ## Implementation Phases ### Phase 1: Core Infrastructure (Week 1-2) **Goal**: Query parsing and basic execution **Tasks**: 1. Implement SQL/Cypher parser 2. Build AST representation 3. Implement basic query executor (no optimization) 4. Unit tests for parsing and execution **Deliverables**: - `neuro_symbolic/parser.rs` - `neuro_symbolic/executor.rs` - Passing unit tests **Success Criteria**: - Can parse and execute simple queries (vector similarity only) - Correct results (matches HNSW baseline) ### Phase 2: Metadata Indexing (Week 2-3) **Goal**: Support symbolic predicates **Tasks**: 1. Implement inverted index for categorical fields 2. Implement B-tree index for range queries 3. Integrate Roaring Bitmap for set operations 4. Test index correctness and performance **Deliverables**: - `neuro_symbolic/indexes.rs` - Index creation and query APIs - Benchmark report **Success Criteria**: - Indexes correctly return matching nodes - Index queries <10ms for typical workloads - Memory overhead <20% of vector data ### Phase 3: Filter Pushdown (Week 3-4) **Goal**: Optimize query execution **Tasks**: 1. Implement filter pushdown into HNSW search 2. Modify HNSW to support filter callbacks 3. Benchmark speedup vs post-filtering 4. Test correctness of pushdown logic **Deliverables**: - Modified `hnsw.rs` with filter support - `neuro_symbolic/optimizer.rs` - Performance benchmarks **Success Criteria**: - ≥5x speedup for filtered queries - Zero correctness regressions - Works with complex boolean logic (AND/OR/NOT) ### Phase 4: Hybrid Scoring (Week 4-5) **Goal**: Combine neural and symbolic scores **Tasks**: 1. Implement hybrid scoring algorithm 2. Add score normalization methods 3. Tune weights (α, β) for best results 4. Test on real-world datasets **Deliverables**: - `neuro_symbolic/scoring.rs` - Hybrid scoring benchmarks - Configuration guide **Success Criteria**: - Hybrid queries improve relevance metrics (NDCG, MRR) - Configurable weights work as expected - Performance <20ms for typical queries ### Phase 5: Production Hardening (Week 5-6) **Goal**: Production-ready feature **Tasks**: 1. Add comprehensive error handling 2. Write documentation and examples 3. Stress testing (large datasets, complex queries) 4. Integration with existing Cypher support **Deliverables**: - Full error handling - User documentation - Example queries - Regression test suite **Success Criteria**: - Zero crashes in stress tests - Documentation complete - Ready for alpha release ## Success Metrics ### Performance Benchmarks **Primary Metrics** (Must Achieve): | Query Type | Baseline (Post-Filter) | Neuro-Symbolic | Target Improvement | |------------|------------------------|----------------|--------------------| | Similarity + 1 filter | 50ms | 5ms | **10x faster** | | Similarity + 3 filters | 200ms | 8ms | **25x faster** | | Complex boolean (AND/OR/NOT) | N/A (manual) | 15ms | **New capability** | | Multi-modal (vector + graph) | 500ms (manual joins) | 20ms | **25x faster** | **Secondary Metrics**: | Metric | Target | |--------|--------| | Index memory overhead | <20% of vector data | | Query parsing time | <1ms | | Hybrid scoring overhead | <2ms | | Concurrent query throughput | Same as baseline | ### Accuracy Metrics **Relevance Improvement** (on benchmark datasets): - NDCG@10: +15% (hybrid scoring vs pure vector) - MRR (Mean Reciprocal Rank): +20% - Precision@10: +10% **Correctness**: - 100% of filtered results match all predicates - Zero false positives or false negatives ### Memory/Latency Targets **Memory**: - Inverted indexes: <100MB per 1M nodes (categorical fields) - B-tree indexes: <50MB per 1M nodes (range fields) - Total overhead: <20% of vector index size **Latency**: - Simple query (1 filter): <10ms - Complex query (3+ filters): <20ms - Hybrid scoring: <5ms overhead - P99 latency: <50ms **Throughput**: - Concurrent queries: Same as baseline HNSW - No lock contention on indexes ## Risks and Mitigations ### Technical Risks **Risk 1: Query Parser Complexity** *Probability: Medium | Impact: Medium* **Description**: SQL/Cypher parsing is complex, could have bugs or performance issues. **Mitigation**: - Use established parsing libraries (`sqlparser`, `cypher-parser`) - Extensive test suite with edge cases - Validate AST before execution - Provide query validation tool **Contingency**: Start with simple query subset, expand incrementally. --- **Risk 2: Index Memory Overhead** *Probability: High | Impact: Medium* **Description**: Metadata indexes could consume excessive memory on large datasets. **Mitigation**: - Use compressed indexes (Roaring Bitmap for sparse sets) - Make indexing optional (user chooses which fields to index) - Monitor memory usage in tests - Provide index size estimation tool **Contingency**: Support external indexes (e.g., SQLite) for low-memory environments. --- **Risk 3: Filter Pushdown Bugs** *Probability: Medium | Impact: Critical* **Description**: Incorrect filter logic could return wrong results. **Mitigation**: - Extensive correctness testing (ground truth validation) - Compare pushdown results vs post-filtering - Add assertion checks in debug builds - Fuzzing for edge cases **Contingency**: Add "safe mode" that validates results against post-filtering. --- **Risk 4: Hybrid Scoring Tuning Difficulty** *Probability: High | Impact: Low* **Description**: Users may struggle to tune α/β weights for hybrid scoring. **Mitigation**: - Provide automatic weight tuning (based on query logs) - Document recommended defaults for common use cases - Add visualization tools for score distributions - Support A/B testing framework **Contingency**: Default to pure neural scoring (α=1, β=0) if user unsure. --- **Risk 5: Cypher Integration Conflicts** *Probability: Low | Impact: Medium* **Description**: Extending Cypher syntax could conflict with existing graph queries. **Mitigation**: - Careful syntax design (use reserved keywords) - Version Cypher extensions separately - Extensive compatibility testing - Document syntax differences **Contingency**: Use separate query language (e.g., extended SQL only) if conflicts arise. --- ### Summary Risk Matrix | Risk | Probability | Impact | Mitigation Priority | |------|-------------|--------|---------------------| | Query parser complexity | Medium | Medium | Medium | | Index memory overhead | High | Medium | **HIGH** | | Filter pushdown bugs | Medium | Critical | **CRITICAL** | | Hybrid scoring tuning | High | Low | LOW | | Cypher integration conflicts | Low | Medium | Medium | --- ## Next Steps 1. **Prototype Phase 1**: Build SQL parser and basic executor (1 week) 2. **Validate Queries**: Test on simple queries, measure correctness (2 days) 3. **Add Metadata Indexes**: Implement inverted + B-tree indexes (1 week) 4. **Benchmark Performance**: Measure speedup vs post-filtering (3 days) 5. **Iterate**: Optimize based on profiling (ongoing) **Key Decision Points**: - After Phase 1: Is query parsing fast enough? (<1ms target) - After Phase 3: Does filter pushdown work correctly? (Zero regressions) - After Phase 4: Does hybrid scoring improve relevance? (+10% NDCG required) **Go/No-Go Criteria**: - ✅ 5x+ speedup on filtered queries - ✅ Zero correctness regressions - ✅ Memory overhead <20% - ✅ Improved relevance metrics