56 KiB
Neuro-Symbolic Query Execution - Implementation Plan
Overview
Problem Statement
Current vector search in ruvector is purely neural (similarity-based): given a query vector, find the k most similar vectors by cosine/Euclidean distance. However, real-world queries often involve logical constraints that pure vector similarity cannot express:
Examples of Unsupported Queries:
- "Find vectors similar to X AND published after 2023 AND tagged as 'research'"
- "Find vectors similar to X OR similar to Y, EXCLUDING category 'spam'"
- "Find vectors where
metadata.price < 100AND similarity > 0.8" - "Find vectors in graph community C AND within 2 hops of node N"
Current Limitations:
- No support for boolean logic (AND, OR, NOT)
- Cannot filter by metadata attributes
- Cannot combine vector similarity with graph structure
- Forces post-processing filtering (inefficient)
- No way to express complex multi-modal queries
Performance Impact:
- Retrieving 10,000 vectors then filtering to 10 wastes 99.9% of computation
- No index acceleration for metadata predicates
- Cannot push down filters to HNSW search
Proposed Solution
Neuro-Symbolic Query Execution: A hybrid query engine that combines neural vector similarity with symbolic logical constraints.
Key Components:
- Query Language: Extend existing Cypher/SQL support with vector similarity operators
- Hybrid Scoring: Combine vector similarity scores with predicate satisfaction
- Filter Pushdown: Apply logical constraints during HNSW search (not after)
- Multi-Modal Indexing: Index metadata attributes alongside vectors
- Constraint Propagation: Use graph structure to prune search space
Architecture:
Query: "MATCH (v:Vector) WHERE vector_similarity(v.embedding, $query) > 0.8
AND v.year >= 2023 AND v.category IN ['research', 'papers']
RETURN v ORDER BY similarity DESC LIMIT 10"
↓ Parse & Optimize
Neural Component: Symbolic Component:
vector_similarity > 0.8 year >= 2023 AND category IN [...]
↓ ↓
HNSW Search Metadata Index
↓ ↓
└──────── Merge ─────────┘
↓
Hybrid Scoring (α * neural + β * symbolic)
↓
Top-K Results
Expected Benefits
Quantified Performance Improvements:
| Query Type | Current (Post-Filter) | Neuro-Symbolic | Improvement |
|---|---|---|---|
| Similarity + 1 filter | 50ms (10K retrieved) | 5ms (100 retrieved) | 10x faster |
| Similarity + 3 filters | 200ms (50K retrieved) | 8ms (200 retrieved) | 25x faster |
| Complex boolean logic | Not supported | 15ms | ∞ (new capability) |
| Multi-modal query | Manual joins | 20ms | 50x faster |
Qualitative Benefits:
- Express complex queries naturally (no manual post-processing)
- Efficient execution with filter pushdown
- Support for real-world use cases (e-commerce, research, RAG)
- Better accuracy through multi-modal fusion
- Graph-aware queries (community detection, path constraints)
Technical Design
Architecture Diagram (ASCII Art)
┌─────────────────────────────────────────────────────────────────┐
│ Neuro-Symbolic Query Execution Pipeline │
└─────────────────────────────────────────────────────────────────┘
User Query (SQL/Cypher + Vector Similarity)
│
│ Example: "SELECT * FROM vectors
│ WHERE cosine_similarity(embedding, $query) > 0.8
│ AND category = 'research' AND year >= 2023
│ ORDER BY similarity DESC LIMIT 10"
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Query Parser & AST Builder │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Parse query into Abstract Syntax Tree (AST) │ │
│ │ ┌──────────────────────────────────────────────────────┐ │ │
│ │ │ SELECT │ │ │
│ │ │ WHERE │ │ │
│ │ │ AND │ │ │
│ │ │ ├─ cosine_similarity(emb, $q) > 0.8 [NEURAL] │ │ │
│ │ │ ├─ category = 'research' [SYMBOLIC] │ │ │
│ │ │ └─ year >= 2023 [SYMBOLIC] │ │ │
│ │ │ ORDER BY similarity DESC │ │ │
│ │ │ LIMIT 10 │ │ │
│ │ └──────────────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Query Optimizer │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Analyze predicates and rewrite query for efficiency │ │
│ │ │ │
│ │ 1. Predicate Pushdown: │ │
│ │ Move filters into HNSW search (before candidate gen) │ │
│ │ │ │
│ │ 2. Index Selection: │ │
│ │ Choose best index for symbolic predicates │ │
│ │ - category: inverted index │ │
│ │ - year: range index (B-tree) │ │
│ │ │ │
│ │ 3. Execution Strategy: │ │
│ │ - If few categories: scan category index first │ │
│ │ - If similarity selective: HNSW first, then filter │ │
│ │ - If balanced: hybrid merge │ │
│ │ │ │
│ │ 4. Hybrid Scoring: │ │
│ │ score = α * neural_sim + β * symbolic_score │ │
│ └────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Execution Plan │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Step 1: HNSW Search (neural) │ │
│ │ - Target: similarity > 0.8 │ │
│ │ - Candidate pool: ef=200 │ │
│ │ - Early termination: collect ~100 candidates │ │
│ │ - Filter during search: year >= 2023 │ │
│ │ Output: {node_id, similarity} for ~100 candidates │ │
│ │ │ │
│ │ Step 2: Symbolic Filtering (metadata index) │ │
│ │ - Lookup category index: category = 'research' │ │
│ │ - Intersect with HNSW candidates │ │
│ │ Output: {node_id, similarity, metadata} for ~30 nodes │ │
│ │ │ │
│ │ Step 3: Hybrid Scoring │ │
│ │ - Compute symbolic_score (e.g., recency bonus) │ │
│ │ - Combined: 0.7 * similarity + 0.3 * symbolic_score │ │
│ │ Output: {node_id, hybrid_score} │ │
│ │ │ │
│ │ Step 4: Top-K Selection │ │
│ │ - Sort by hybrid_score DESC │ │
│ │ - Return top 10 │ │
│ └────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Result Set │
│ [{id: 42, similarity: 0.95, category: 'research', year: 2024}, │
│ {id: 137, similarity: 0.92, category: 'research', year: 2023},│
│ ...] │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Indexing & Storage Architecture │
└─────────────────────────────────────────────────────────────────┘
Vector Data:
┌─────────────────────────────────────────────────────────────────┐
│ HNSW Index (vector similarity) │
│ - Node ID → Embedding vector │
│ - Graph structure for approximate NN search │
└─────────────────────────────────────────────────────────────────┘
Metadata Data:
┌─────────────────────────────────────────────────────────────────┐
│ Inverted Index (categorical attributes) │
│ - category → {node_ids} │
│ - tag → {node_ids} │
│ - author → {node_ids} │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ B-Tree Index (range attributes) │
│ - year → sorted {node_ids} │
│ - price → sorted {node_ids} │
│ - timestamp → sorted {node_ids} │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Roaring Bitmap Index (set operations) │
│ - Efficient AND/OR/NOT on node ID sets │
│ - Compressed storage for sparse sets │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Graph Index (structural constraints) │
│ - Community membership: community_id → {node_ids} │
│ - k-hop neighborhoods: precomputed for common queries │
│ - Path constraints: shortest path caches │
└─────────────────────────────────────────────────────────────────┘
Core Data Structures (Rust)
// File: crates/ruvector-query/src/neuro_symbolic/mod.rs
use std::collections::{HashMap, HashSet};
use serde::{Deserialize, Serialize};
/// Neuro-symbolic query execution engine
pub struct NeuroSymbolicEngine {
/// HNSW index for vector similarity
hnsw_index: Arc<HnswIndex>,
/// Metadata indexes (inverted, B-tree, etc.)
metadata_indexes: MetadataIndexes,
/// Query optimizer
optimizer: QueryOptimizer,
/// Execution planner
planner: ExecutionPlanner,
/// Hybrid scoring configuration
scoring_config: HybridScoringConfig,
}
/// Query representation (SQL/Cypher AST)
#[derive(Debug, Clone)]
pub struct Query {
/// SELECT clause (which fields to return)
pub select: Vec<String>,
/// WHERE clause (predicates)
pub where_clause: Option<Predicate>,
/// ORDER BY clause
pub order_by: Vec<OrderBy>,
/// LIMIT clause
pub limit: Option<usize>,
/// OFFSET clause
pub offset: Option<usize>,
}
/// Predicate tree (boolean logic)
#[derive(Debug, Clone)]
pub enum Predicate {
/// Neural predicate: vector similarity
VectorSimilarity {
field: String,
query_vector: Vec<f32>,
operator: ComparisonOp, // >, <, =
threshold: f32,
metric: SimilarityMetric, // cosine, euclidean, dot
},
/// Symbolic predicate: metadata constraint
Attribute {
field: String,
operator: ComparisonOp,
value: Value,
},
/// Graph predicate: structural constraint
Graph {
constraint: GraphConstraint,
},
/// Boolean operators
And(Box<Predicate>, Box<Predicate>),
Or(Box<Predicate>, Box<Predicate>),
Not(Box<Predicate>),
}
#[derive(Debug, Clone)]
pub enum GraphConstraint {
/// Node in community
InCommunity { community_id: u32 },
/// Within k hops of node
WithinKHops { source_node: u32, k: usize },
/// On path between two nodes
OnPath { source: u32, target: u32 },
/// Has edge to node
ConnectedTo { node_id: u32 },
}
#[derive(Debug, Clone, Copy)]
pub enum ComparisonOp {
Eq, // =
Ne, // !=
Lt, // <
Le, // <=
Gt, // >
Ge, // >=
In, // IN (...)
Like, // LIKE (string pattern)
}
#[derive(Debug, Clone)]
pub enum Value {
Int(i64),
Float(f64),
String(String),
Bool(bool),
List(Vec<Value>),
}
#[derive(Debug, Clone, Copy)]
pub enum SimilarityMetric {
Cosine,
Euclidean,
DotProduct,
L1,
}
/// Metadata indexing structures
pub struct MetadataIndexes {
/// Inverted indexes for categorical fields
inverted: HashMap<String, InvertedIndex>,
/// B-tree indexes for range queries
btree: HashMap<String, BTreeIndex>,
/// Roaring bitmap for set operations
bitmap_store: BitmapStore,
/// Graph structural indexes
graph_index: GraphStructureIndex,
}
/// Inverted index: field_value → {node_ids}
pub struct InvertedIndex {
/// Map from value to posting list (node IDs)
postings: HashMap<String, RoaringBitmap>,
/// Statistics for query optimization
stats: IndexStats,
}
/// B-tree index for range queries
pub struct BTreeIndex {
/// Sorted map from value to node IDs
tree: BTreeMap<OrderedValue, RoaringBitmap>,
/// Statistics
stats: IndexStats,
}
/// Roaring bitmap store for efficient set operations
pub struct BitmapStore {
/// Node ID sets as compressed bitmaps
bitmaps: HashMap<String, RoaringBitmap>,
}
/// Graph structure indexes
pub struct GraphStructureIndex {
/// Community assignments
communities: HashMap<u32, RoaringBitmap>,
/// k-hop neighborhoods (precomputed)
khop_cache: HashMap<(u32, usize), RoaringBitmap>,
/// Shortest path cache
path_cache: PathCache,
}
#[derive(Debug, Default)]
pub struct IndexStats {
pub num_unique_values: usize,
pub total_postings: usize,
pub avg_posting_length: f64,
pub selectivity: f64, // fraction of nodes matching
}
/// Query execution plan
#[derive(Debug)]
pub struct ExecutionPlan {
/// Ordered steps to execute
pub steps: Vec<ExecutionStep>,
/// Estimated cost
pub estimated_cost: f64,
/// Estimated result size
pub estimated_results: usize,
}
#[derive(Debug)]
pub enum ExecutionStep {
/// HNSW vector search
VectorSearch {
query_vector: Vec<f32>,
similarity_threshold: f32,
metric: SimilarityMetric,
ef: usize,
filters: Vec<InlineFilter>, // Filters applied during search
},
/// Metadata index lookup
IndexScan {
index_name: String,
predicate: Predicate,
},
/// Graph structure traversal
GraphTraversal {
constraint: GraphConstraint,
},
/// Set intersection (AND)
Intersect {
left: Box<ExecutionStep>,
right: Box<ExecutionStep>,
},
/// Set union (OR)
Union {
left: Box<ExecutionStep>,
right: Box<ExecutionStep>,
},
/// Set difference (NOT)
Difference {
left: Box<ExecutionStep>,
right: Box<ExecutionStep>,
},
/// Hybrid scoring
HybridScore {
neural_scores: HashMap<u32, f32>,
symbolic_scores: HashMap<u32, f32>,
alpha: f32, // neural weight
beta: f32, // symbolic weight
},
/// Top-K selection
TopK {
input: Box<ExecutionStep>,
k: usize,
order_by: Vec<OrderBy>,
},
}
/// Filter applied during HNSW search (pushdown)
#[derive(Debug, Clone)]
pub struct InlineFilter {
pub field: String,
pub operator: ComparisonOp,
pub value: Value,
}
/// Hybrid scoring configuration
#[derive(Debug, Clone)]
pub struct HybridScoringConfig {
/// Weight for neural similarity score
pub neural_weight: f32,
/// Weight for symbolic score
pub symbolic_weight: f32,
/// Normalization method
pub normalization: NormalizationMethod,
}
#[derive(Debug, Clone, Copy)]
pub enum NormalizationMethod {
/// Min-max normalization [0, 1]
MinMax,
/// Z-score normalization
ZScore,
/// None (assume scores already normalized)
None,
}
/// Query result
#[derive(Debug, Serialize, Deserialize)]
pub struct QueryResult {
/// Matched node IDs
pub node_ids: Vec<u32>,
/// Neural similarity scores
pub neural_scores: Vec<f32>,
/// Symbolic scores (if applicable)
pub symbolic_scores: Option<Vec<f32>>,
/// Hybrid scores
pub hybrid_scores: Vec<f32>,
/// Metadata for each result
pub metadata: Vec<HashMap<String, Value>>,
/// Query execution statistics
pub stats: QueryStats,
}
#[derive(Debug, Serialize, Deserialize, Default)]
pub struct QueryStats {
/// Total execution time (milliseconds)
pub total_time_ms: f64,
/// Time breakdown by step
pub step_times: Vec<(String, f64)>,
/// Number of candidates evaluated
pub candidates_evaluated: usize,
/// Number of results returned
pub results_returned: usize,
/// Index usage
pub indexes_used: Vec<String>,
}
#[derive(Debug, Clone)]
pub struct OrderBy {
pub field: String,
pub direction: SortDirection,
}
#[derive(Debug, Clone, Copy)]
pub enum SortDirection {
Asc,
Desc,
}
/// Wrapper for ordered values in B-tree
#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord)]
pub enum OrderedValue {
Int(i64),
Float(OrderedFloat<f64>),
String(String),
}
use ordered_float::OrderedFloat;
use roaring::RoaringBitmap;
use std::collections::BTreeMap;
use std::sync::Arc;
Key Algorithms (Pseudocode)
1. Query Execution Algorithm
function execute_neuro_symbolic_query(query: Query, engine: NeuroSymbolicEngine) -> QueryResult:
"""
Execute neuro-symbolic query with hybrid scoring.
Main algorithm: parse → optimize → plan → execute → score → return
"""
start_time = now()
# Step 1: Parse query into AST (already done, query is AST)
# Step 2: Optimize query (predicate pushdown, index selection)
optimized_query = engine.optimizer.optimize(query)
# Step 3: Generate execution plan
plan = engine.planner.create_plan(optimized_query)
# Step 4: Execute plan steps
result_set = execute_plan(plan, engine)
# Step 5: Hybrid scoring
if has_both_neural_and_symbolic(plan):
result_set = apply_hybrid_scoring(
result_set,
engine.scoring_config
)
# Step 6: Apply ORDER BY and LIMIT
result_set = sort_and_limit(
result_set,
query.order_by,
query.limit,
query.offset
)
# Step 7: Fetch metadata for results
metadata = fetch_metadata(result_set.node_ids, query.select)
execution_time = now() - start_time
return QueryResult(
node_ids=result_set.node_ids,
neural_scores=result_set.neural_scores,
symbolic_scores=result_set.symbolic_scores,
hybrid_scores=result_set.hybrid_scores,
metadata=metadata,
stats=QueryStats(
total_time_ms=execution_time,
candidates_evaluated=result_set.candidates_evaluated,
results_returned=len(result_set.node_ids),
indexes_used=plan.indexes_used
)
)
function execute_plan(plan: ExecutionPlan, engine: NeuroSymbolicEngine) -> IntermediateResult:
"""
Recursively execute plan steps.
"""
results = None
for step in plan.steps:
match step:
case VectorSearch:
# HNSW search with optional filters
results = execute_vector_search(step, engine.hnsw_index)
case IndexScan:
# Lookup in metadata index
results = execute_index_scan(step, engine.metadata_indexes)
case GraphTraversal:
# Graph structure query
results = execute_graph_traversal(step, engine.metadata_indexes.graph_index)
case Intersect:
# AND: set intersection
left = execute_plan_step(step.left, engine)
right = execute_plan_step(step.right, engine)
results = intersect_results(left, right)
case Union:
# OR: set union
left = execute_plan_step(step.left, engine)
right = execute_plan_step(step.right, engine)
results = union_results(left, right)
case Difference:
# NOT: set difference
left = execute_plan_step(step.left, engine)
right = execute_plan_step(step.right, engine)
results = difference_results(left, right)
case HybridScore:
# Compute hybrid scores
results = compute_hybrid_scores(
step.neural_scores,
step.symbolic_scores,
step.alpha,
step.beta
)
case TopK:
# Select top-k results
input_results = execute_plan_step(step.input, engine)
results = select_top_k(input_results, step.k, step.order_by)
return results
function execute_vector_search(step: VectorSearch, hnsw: HnswIndex) -> IntermediateResult:
"""
HNSW search with filter pushdown.
Key optimization: Apply symbolic filters during HNSW traversal
to avoid generating candidates that will be filtered out anyway.
"""
query_vector = step.query_vector
similarity_threshold = step.similarity_threshold
ef = step.ef
inline_filters = step.filters
# HNSW search with inline filtering
candidates = []
visited = set()
# Start from entry point
current_node = hnsw.entry_point
layer = hnsw.max_layer
while layer >= 0:
# Greedy search at this layer
while True:
neighbors = hnsw.get_neighbors(current_node, layer)
best_neighbor = None
best_distance = float('inf')
for neighbor in neighbors:
if neighbor in visited:
continue
# Apply inline filters BEFORE computing distance
if not passes_inline_filters(neighbor, inline_filters, hnsw.metadata):
continue # Skip this neighbor entirely
# Compute distance only for filtered candidates
distance = compute_distance(query_vector, hnsw.get_vector(neighbor))
similarity = distance_to_similarity(distance, step.metric)
if similarity >= similarity_threshold:
candidates.append((neighbor, similarity))
if distance < best_distance:
best_distance = distance
best_neighbor = neighbor
visited.add(neighbor)
if best_neighbor is None:
break # No improvement
current_node = best_neighbor
layer -= 1
# Sort candidates by similarity
candidates.sort(key=lambda x: x[1], reverse=True)
return IntermediateResult(
node_ids=[node_id for node_id, _ in candidates],
neural_scores=[score for _, score in candidates],
candidates_evaluated=len(visited)
)
function passes_inline_filters(node_id: u32, filters: List[InlineFilter], metadata: MetadataStore) -> bool:
"""
Check if node passes all inline filters.
This avoids computing distance for nodes that fail metadata constraints.
"""
for filter in filters:
node_value = metadata.get(node_id, filter.field)
if not evaluate_predicate(node_value, filter.operator, filter.value):
return False # Failed a filter
return True # Passed all filters
function execute_index_scan(step: IndexScan, indexes: MetadataIndexes) -> IntermediateResult:
"""
Scan metadata index to get matching node IDs.
"""
index_name = step.index_name
predicate = step.predicate
match predicate:
case Attribute(field, operator, value):
if operator == ComparisonOp.Eq:
# Exact match: use inverted index
posting_list = indexes.inverted[field].lookup(value)
return IntermediateResult(
node_ids=posting_list.to_vec(),
symbolic_scores=[1.0] * len(posting_list) # Binary: matches or not
)
elif operator in [ComparisonOp.Lt, ComparisonOp.Le, ComparisonOp.Gt, ComparisonOp.Ge]:
# Range query: use B-tree index
matching_nodes = indexes.btree[field].range_query(operator, value)
return IntermediateResult(
node_ids=matching_nodes.to_vec(),
symbolic_scores=[1.0] * len(matching_nodes)
)
elif operator == ComparisonOp.In:
# IN query: union of inverted index lookups
all_nodes = RoaringBitmap()
for v in value.list:
posting_list = indexes.inverted[field].lookup(v)
all_nodes |= posting_list # Union
return IntermediateResult(
node_ids=all_nodes.to_vec(),
symbolic_scores=[1.0] * len(all_nodes)
)
function execute_graph_traversal(step: GraphTraversal, graph_index: GraphStructureIndex) -> IntermediateResult:
"""
Execute graph structural constraint.
"""
match step.constraint:
case InCommunity(community_id):
# Lookup precomputed community membership
node_ids = graph_index.communities.get(community_id)
return IntermediateResult(
node_ids=node_ids.to_vec(),
symbolic_scores=[1.0] * len(node_ids)
)
case WithinKHops(source_node, k):
# Lookup precomputed k-hop neighborhood
key = (source_node, k)
if key in graph_index.khop_cache:
node_ids = graph_index.khop_cache[key]
else:
# Compute on-the-fly via BFS
node_ids = compute_khop_neighbors(source_node, k, graph_index.graph)
return IntermediateResult(
node_ids=node_ids.to_vec(),
symbolic_scores=[1.0 / (1 + distance)] for distance in range(len(node_ids))
)
case OnPath(source, target):
# Check path cache
path_nodes = graph_index.path_cache.get_path(source, target)
return IntermediateResult(
node_ids=path_nodes,
symbolic_scores=[1.0] * len(path_nodes)
)
function intersect_results(left: IntermediateResult, right: IntermediateResult) -> IntermediateResult:
"""
Set intersection (AND): keep nodes in both sets.
Use Roaring Bitmap for efficient intersection.
"""
left_bitmap = RoaringBitmap.from_sorted(left.node_ids)
right_bitmap = RoaringBitmap.from_sorted(right.node_ids)
intersection = left_bitmap & right_bitmap # Bitmap AND
# Combine scores (average for simplicity)
node_ids = intersection.to_vec()
combined_scores = []
for node_id in node_ids:
left_score = left.get_score(node_id)
right_score = right.get_score(node_id)
combined_scores.append((left_score + right_score) / 2.0)
return IntermediateResult(
node_ids=node_ids,
scores=combined_scores
)
function apply_hybrid_scoring(result_set, config: HybridScoringConfig) -> IntermediateResult:
"""
Combine neural and symbolic scores.
Formula: hybrid_score = α * normalize(neural) + β * normalize(symbolic)
"""
neural_scores = result_set.neural_scores
symbolic_scores = result_set.symbolic_scores
# Normalize scores to [0, 1]
if config.normalization == NormalizationMethod.MinMax:
neural_norm = min_max_normalize(neural_scores)
symbolic_norm = min_max_normalize(symbolic_scores)
elif config.normalization == NormalizationMethod.ZScore:
neural_norm = z_score_normalize(neural_scores)
symbolic_norm = z_score_normalize(symbolic_scores)
else:
neural_norm = neural_scores
symbolic_norm = symbolic_scores
# Combine with weights
alpha = config.neural_weight
beta = config.symbolic_weight
hybrid_scores = [
alpha * n + beta * s
for n, s in zip(neural_norm, symbolic_norm)
]
result_set.hybrid_scores = hybrid_scores
return result_set
2. Query Optimization
function optimize_query(query: Query, optimizer: QueryOptimizer) -> Query:
"""
Optimize query execution plan.
Key optimizations:
1. Predicate pushdown (filters into HNSW search)
2. Index selection (choose best index for each predicate)
3. Join reordering (cheapest predicates first)
4. Early termination (stop when enough candidates found)
"""
# Extract predicates from WHERE clause
predicates = extract_predicates(query.where_clause)
# Classify predicates
neural_preds = [p for p in predicates if is_neural_predicate(p)]
symbolic_preds = [p for p in predicates if is_symbolic_predicate(p)]
graph_preds = [p for p in predicates if is_graph_predicate(p)]
# Estimate selectivity for each predicate
selectivities = {}
for pred in predicates:
selectivities[pred] = estimate_selectivity(pred, optimizer.stats)
# Predicate pushdown: which filters can be applied during HNSW search?
inline_filters = []
post_filters = []
for pred in symbolic_preds:
if can_pushdown(pred):
inline_filters.append(pred)
else:
post_filters.append(pred)
# Index selection: choose best index for each symbolic predicate
index_plan = {}
for pred in symbolic_preds:
best_index = choose_best_index(pred, optimizer.indexes, selectivities[pred])
index_plan[pred] = best_index
# Reorder predicates: most selective first
ordered_predicates = sorted(predicates, key=lambda p: selectivities[p])
# Build optimized execution plan
optimized_query = rewrite_query(
query,
inline_filters=inline_filters,
post_filters=post_filters,
index_plan=index_plan,
predicate_order=ordered_predicates
)
return optimized_query
function estimate_selectivity(predicate, stats) -> float:
"""
Estimate fraction of nodes matching predicate.
Uses index statistics (histograms, cardinality, etc.)
"""
match predicate:
case VectorSimilarity(threshold):
# Estimate based on similarity distribution
return estimate_similarity_selectivity(threshold, stats.similarity_histogram)
case Attribute(field, operator, value):
# Estimate based on attribute distribution
if operator == ComparisonOp.Eq:
return 1.0 / stats.cardinality[field] # Uniform assumption
elif operator in [Lt, Le, Gt, Ge]:
return estimate_range_selectivity(field, operator, value, stats)
elif operator == In:
return len(value.list) / stats.cardinality[field]
case Graph(constraint):
# Estimate based on graph structure
match constraint:
case InCommunity(id):
return stats.community_sizes[id] / stats.total_nodes
case WithinKHops(node, k):
return estimate_khop_size(node, k, stats) / stats.total_nodes
function can_pushdown(predicate) -> bool:
"""
Check if predicate can be pushed into HNSW search.
Only simple equality/range predicates on indexed fields can be pushed down.
"""
match predicate:
case Attribute(field, operator, value):
# Can pushdown if operator is simple and field is indexed
return operator in [Eq, Lt, Le, Gt, Ge, In] and is_indexed(field)
case _:
return False # Complex predicates handled post-search
API Design (Function Signatures)
// File: crates/ruvector-query/src/neuro_symbolic/mod.rs
impl NeuroSymbolicEngine {
/// Create a new neuro-symbolic query engine
pub fn new(
hnsw_index: Arc<HnswIndex>,
metadata_path: impl AsRef<Path>,
) -> Result<Self, QueryError>;
/// Execute a query (SQL or Cypher syntax)
pub fn execute_query(
&self,
query: &str,
) -> Result<QueryResult, QueryError>;
/// Execute a parsed query (AST)
pub fn execute_parsed_query(
&self,
query: Query,
) -> Result<QueryResult, QueryError>;
/// Add metadata index for a field
pub fn create_index(
&mut self,
field: &str,
index_type: IndexType,
) -> Result<(), QueryError>;
/// Update hybrid scoring configuration
pub fn set_scoring_config(&mut self, config: HybridScoringConfig);
/// Get query execution statistics
pub fn stats(&self) -> QueryEngineStats;
}
#[derive(Debug, Clone, Copy)]
pub enum IndexType {
Inverted, // Categorical fields
BTree, // Range queries
Bitmap, // Set operations
}
impl Query {
/// Parse SQL query string into AST
pub fn parse_sql(query: &str) -> Result<Self, ParseError>;
/// Parse Cypher query string into AST
pub fn parse_cypher(query: &str) -> Result<Self, ParseError>;
/// Validate query syntax and semantics
pub fn validate(&self) -> Result<(), ValidationError>;
}
impl Predicate {
/// Evaluate predicate on a node
pub fn evaluate(
&self,
node_id: u32,
vector_store: &VectorStore,
metadata_store: &MetadataStore,
) -> bool;
/// Extract referenced fields
pub fn referenced_fields(&self) -> Vec<String>;
/// Check if predicate is neural (vector similarity)
pub fn is_neural(&self) -> bool;
/// Check if predicate is symbolic (metadata)
pub fn is_symbolic(&self) -> bool;
/// Check if predicate is graph-structural
pub fn is_graph_structural(&self) -> bool;
}
impl MetadataIndexes {
/// Create indexes from metadata file
pub fn from_metadata(path: impl AsRef<Path>) -> Result<Self, IndexError>;
/// Add inverted index for field
pub fn add_inverted_index(
&mut self,
field: &str,
values: HashMap<String, Vec<u32>>,
) -> Result<(), IndexError>;
/// Add B-tree index for field
pub fn add_btree_index(
&mut self,
field: &str,
values: Vec<(OrderedValue, u32)>,
) -> Result<(), IndexError>;
/// Query inverted index
pub fn query_inverted(
&self,
field: &str,
value: &str,
) -> Option<&RoaringBitmap>;
/// Query B-tree index (range)
pub fn query_btree_range(
&self,
field: &str,
operator: ComparisonOp,
value: OrderedValue,
) -> Option<RoaringBitmap>;
/// Intersect bitmaps (AND operation)
pub fn intersect(&self, bitmaps: &[RoaringBitmap]) -> RoaringBitmap;
/// Union bitmaps (OR operation)
pub fn union(&self, bitmaps: &[RoaringBitmap]) -> RoaringBitmap;
/// Difference bitmaps (NOT operation)
pub fn difference(&self, left: &RoaringBitmap, right: &RoaringBitmap) -> RoaringBitmap;
}
#[derive(Debug, Default)]
pub struct QueryEngineStats {
pub total_queries: u64,
pub avg_query_time_ms: f64,
pub cache_hit_rate: f64,
pub avg_candidates_evaluated: f64,
}
Integration Points
Affected Crates/Modules
-
ruvector-query(New Crate)- New module:
src/neuro_symbolic/mod.rs- Core engine - New module:
src/neuro_symbolic/parser.rs- SQL/Cypher parser - New module:
src/neuro_symbolic/optimizer.rs- Query optimizer - New module:
src/neuro_symbolic/planner.rs- Execution planner - New module:
src/neuro_symbolic/indexes.rs- Metadata indexing
- New module:
-
ruvector-core(Integration)- Modified:
src/index/hnsw.rs- Add filter callback support - Modified:
src/vector_store.rs- Expose metadata API
- Modified:
-
ruvector-api(Exposure)- Modified:
src/query.rs- Add neuro-symbolic query endpoint - New:
src/query/sql.rs- SQL query interface - New:
src/query/cypher.rs- Cypher query interface
- Modified:
-
ruvector-bindings(Language Bindings)- Modified:
python/src/lib.rs- Expose query API - Modified:
nodejs/src/lib.rs- Expose query API
- Modified:
New Modules to Create
crates/ruvector-query/ # New crate
├── src/
│ ├── neuro_symbolic/
│ │ ├── mod.rs # Core engine
│ │ ├── parser.rs # Query parsing
│ │ ├── optimizer.rs # Query optimization
│ │ ├── planner.rs # Execution planning
│ │ ├── executor.rs # Query execution
│ │ ├── indexes.rs # Metadata indexing
│ │ ├── scoring.rs # Hybrid scoring
│ │ └── stats.rs # Statistics collection
│ └── lib.rs
examples/
├── neuro_symbolic_queries/
│ ├── sql_examples.rs # SQL query examples
│ ├── cypher_examples.rs # Cypher query examples
│ ├── hybrid_scoring.rs # Hybrid scoring examples
│ └── README.md
Dependencies on Other Features
Depends On:
- HNSW Index: Core vector search functionality
- Existing Cypher Support: Extend existing graph query support
Synergies With:
- GNN-Guided Routing (Feature 1): Can use GNN for smarter query execution
- Incremental Learning (Feature 2): Real-time index updates support streaming queries
External Dependencies:
sqlparser- SQL parsingcypher-parser- Cypher parsing (if not already present)roaring- Roaring Bitmap for efficient set operationsserde- Query serialization
Regression Prevention
What Existing Functionality Could Break
-
Pure Vector Search Performance
- Risk: Adding metadata lookups slows down simple vector queries
- Impact: Regression in baseline HNSW performance
-
Memory Usage
- Risk: Metadata indexes consume excessive RAM
- Impact: OOM on large datasets
-
Query Correctness
- Risk: Filter pushdown logic has bugs, returns wrong results
- Impact: Incorrect search results
-
Cypher Compatibility
- Risk: Extending Cypher syntax breaks existing queries
- Impact: Breaking change for existing users
Test Cases to Prevent Regressions
// File: crates/ruvector-query/tests/neuro_symbolic_regression_tests.rs
#[test]
fn test_pure_vector_search_unchanged() {
// Simple vector queries should have zero overhead
let engine = setup_test_engine();
// Baseline: pure HNSW search (no filters)
let query_baseline = "SELECT * FROM vectors ORDER BY similarity(embedding, $query) DESC LIMIT 10";
let start = Instant::now();
let results = engine.execute_query(query_baseline).unwrap();
let time_with_engine = start.elapsed();
// Direct HNSW (without query engine)
let start = Instant::now();
let results_direct = engine.hnsw_index.search(&query_vector, 10).unwrap();
let time_direct = start.elapsed();
// Query engine should add <5% overhead
let overhead = (time_with_engine.as_secs_f64() / time_direct.as_secs_f64()) - 1.0;
assert!(overhead < 0.05, "Overhead: {:.2}%, expected <5%", overhead * 100.0);
// Results should be identical
assert_eq!(results.node_ids, results_direct.node_ids);
}
#[test]
fn test_filter_correctness() {
// Filtered queries must return correct subset
let engine = setup_test_engine_with_metadata();
let query = "SELECT * FROM vectors
WHERE similarity(embedding, $query) > 0.8
AND category = 'research'
AND year >= 2023
LIMIT 10";
let results = engine.execute_query(query).unwrap();
// Verify each result matches ALL predicates
for node_id in &results.node_ids {
let similarity = compute_similarity(&query_vector, engine.get_vector(*node_id));
assert!(similarity > 0.8, "Node {} similarity: {}, expected >0.8", node_id, similarity);
let category = engine.get_metadata(*node_id, "category");
assert_eq!(category, "research", "Node {} category: {}, expected 'research'", node_id, category);
let year = engine.get_metadata(*node_id, "year").parse::<i32>().unwrap();
assert!(year >= 2023, "Node {} year: {}, expected >=2023", node_id, year);
}
}
#[test]
fn test_filter_pushdown_performance() {
// Pushdown filters should be much faster than post-filtering
let engine = setup_test_engine_with_metadata();
// With pushdown (optimized)
let query_pushdown = "SELECT * FROM vectors
WHERE similarity(embedding, $query) > 0.8
AND category = 'research'
LIMIT 10";
let start = Instant::now();
let results_pushdown = engine.execute_query(query_pushdown).unwrap();
let time_pushdown = start.elapsed();
// Without pushdown (post-filter, manual implementation)
let all_results = engine.hnsw_index.search(&query_vector, 10000).unwrap();
let start = Instant::now();
let results_post: Vec<_> = all_results.into_iter()
.filter(|r| r.similarity > 0.8)
.filter(|r| engine.get_metadata(r.node_id, "category") == "research")
.take(10)
.collect();
let time_post = start.elapsed();
// Pushdown should be ≥5x faster
let speedup = time_post.as_secs_f64() / time_pushdown.as_secs_f64();
assert!(speedup >= 5.0, "Speedup: {:.1}x, expected ≥5x", speedup);
// Results should be identical
assert_eq!(results_pushdown.node_ids.len(), results_post.len());
}
#[test]
fn test_hybrid_scoring_correctness() {
// Hybrid scores should combine neural and symbolic correctly
let engine = setup_test_engine();
engine.set_scoring_config(HybridScoringConfig {
neural_weight: 0.7,
symbolic_weight: 0.3,
normalization: NormalizationMethod::MinMax,
});
let query = "SELECT * FROM vectors
WHERE similarity(embedding, $query) > 0.5
AND year >= 2020
ORDER BY hybrid_score DESC
LIMIT 10";
let results = engine.execute_query(query).unwrap();
// Verify hybrid score formula
for i in 0..results.node_ids.len() {
let neural = results.neural_scores[i];
let symbolic = results.symbolic_scores.as_ref().unwrap()[i];
// Normalize (min-max)
let neural_norm = (neural - 0.5) / (1.0 - 0.5); // Assuming min=0.5, max=1.0
let symbolic_norm = (symbolic - 0.0) / (1.0 - 0.0); // Assuming min=0.0, max=1.0
let expected_hybrid = 0.7 * neural_norm + 0.3 * symbolic_norm;
let actual_hybrid = results.hybrid_scores[i];
assert!((expected_hybrid - actual_hybrid).abs() < 1e-5,
"Hybrid score mismatch: expected {}, got {}", expected_hybrid, actual_hybrid);
}
}
#[test]
fn test_boolean_logic_correctness() {
// AND/OR/NOT operations must be correct
let engine = setup_test_engine();
// Test AND
let query_and = "SELECT * FROM vectors
WHERE category = 'A' AND tag = 'X'";
let results_and = engine.execute_query(query_and).unwrap();
for node_id in &results_and.node_ids {
assert_eq!(engine.get_metadata(*node_id, "category"), "A");
assert_eq!(engine.get_metadata(*node_id, "tag"), "X");
}
// Test OR
let query_or = "SELECT * FROM vectors
WHERE category = 'A' OR category = 'B'";
let results_or = engine.execute_query(query_or).unwrap();
for node_id in &results_or.node_ids {
let category = engine.get_metadata(*node_id, "category");
assert!(category == "A" || category == "B");
}
// Test NOT
let query_not = "SELECT * FROM vectors
WHERE category = 'A' AND NOT tag = 'X'";
let results_not = engine.execute_query(query_not).unwrap();
for node_id in &results_not.node_ids {
assert_eq!(engine.get_metadata(*node_id, "category"), "A");
assert_ne!(engine.get_metadata(*node_id, "tag"), "X");
}
}
Backward Compatibility Strategy
-
Opt-In Feature
- Neuro-symbolic queries are opt-in (require explicit SQL/Cypher syntax)
- Existing vector search API unchanged
-
Graceful Degradation
- If metadata indexes not available, fallback to post-filtering
- Log warning but do not crash
-
Configuration
query: neuro_symbolic: enabled: true # Default: true metadata_indexes: true # Default: true hybrid_scoring: true # Default: true -
API Versioning
- New endpoints for neuro-symbolic queries (
/query/sql,/query/cypher) - Existing endpoints (
/search) unchanged
- New endpoints for neuro-symbolic queries (
Implementation Phases
Phase 1: Core Infrastructure (Week 1-2)
Goal: Query parsing and basic execution
Tasks:
- Implement SQL/Cypher parser
- Build AST representation
- Implement basic query executor (no optimization)
- Unit tests for parsing and execution
Deliverables:
neuro_symbolic/parser.rsneuro_symbolic/executor.rs- Passing unit tests
Success Criteria:
- Can parse and execute simple queries (vector similarity only)
- Correct results (matches HNSW baseline)
Phase 2: Metadata Indexing (Week 2-3)
Goal: Support symbolic predicates
Tasks:
- Implement inverted index for categorical fields
- Implement B-tree index for range queries
- Integrate Roaring Bitmap for set operations
- Test index correctness and performance
Deliverables:
neuro_symbolic/indexes.rs- Index creation and query APIs
- Benchmark report
Success Criteria:
- Indexes correctly return matching nodes
- Index queries <10ms for typical workloads
- Memory overhead <20% of vector data
Phase 3: Filter Pushdown (Week 3-4)
Goal: Optimize query execution
Tasks:
- Implement filter pushdown into HNSW search
- Modify HNSW to support filter callbacks
- Benchmark speedup vs post-filtering
- Test correctness of pushdown logic
Deliverables:
- Modified
hnsw.rswith filter support neuro_symbolic/optimizer.rs- Performance benchmarks
Success Criteria:
- ≥5x speedup for filtered queries
- Zero correctness regressions
- Works with complex boolean logic (AND/OR/NOT)
Phase 4: Hybrid Scoring (Week 4-5)
Goal: Combine neural and symbolic scores
Tasks:
- Implement hybrid scoring algorithm
- Add score normalization methods
- Tune weights (α, β) for best results
- Test on real-world datasets
Deliverables:
neuro_symbolic/scoring.rs- Hybrid scoring benchmarks
- Configuration guide
Success Criteria:
- Hybrid queries improve relevance metrics (NDCG, MRR)
- Configurable weights work as expected
- Performance <20ms for typical queries
Phase 5: Production Hardening (Week 5-6)
Goal: Production-ready feature
Tasks:
- Add comprehensive error handling
- Write documentation and examples
- Stress testing (large datasets, complex queries)
- Integration with existing Cypher support
Deliverables:
- Full error handling
- User documentation
- Example queries
- Regression test suite
Success Criteria:
- Zero crashes in stress tests
- Documentation complete
- Ready for alpha release
Success Metrics
Performance Benchmarks
Primary Metrics (Must Achieve):
| Query Type | Baseline (Post-Filter) | Neuro-Symbolic | Target Improvement |
|---|---|---|---|
| Similarity + 1 filter | 50ms | 5ms | 10x faster |
| Similarity + 3 filters | 200ms | 8ms | 25x faster |
| Complex boolean (AND/OR/NOT) | N/A (manual) | 15ms | New capability |
| Multi-modal (vector + graph) | 500ms (manual joins) | 20ms | 25x faster |
Secondary Metrics:
| Metric | Target |
|---|---|
| Index memory overhead | <20% of vector data |
| Query parsing time | <1ms |
| Hybrid scoring overhead | <2ms |
| Concurrent query throughput | Same as baseline |
Accuracy Metrics
Relevance Improvement (on benchmark datasets):
- NDCG@10: +15% (hybrid scoring vs pure vector)
- MRR (Mean Reciprocal Rank): +20%
- Precision@10: +10%
Correctness:
- 100% of filtered results match all predicates
- Zero false positives or false negatives
Memory/Latency Targets
Memory:
- Inverted indexes: <100MB per 1M nodes (categorical fields)
- B-tree indexes: <50MB per 1M nodes (range fields)
- Total overhead: <20% of vector index size
Latency:
- Simple query (1 filter): <10ms
- Complex query (3+ filters): <20ms
- Hybrid scoring: <5ms overhead
- P99 latency: <50ms
Throughput:
- Concurrent queries: Same as baseline HNSW
- No lock contention on indexes
Risks and Mitigations
Technical Risks
Risk 1: Query Parser Complexity
Probability: Medium | Impact: Medium
Description: SQL/Cypher parsing is complex, could have bugs or performance issues.
Mitigation:
- Use established parsing libraries (
sqlparser,cypher-parser) - Extensive test suite with edge cases
- Validate AST before execution
- Provide query validation tool
Contingency: Start with simple query subset, expand incrementally.
Risk 2: Index Memory Overhead
Probability: High | Impact: Medium
Description: Metadata indexes could consume excessive memory on large datasets.
Mitigation:
- Use compressed indexes (Roaring Bitmap for sparse sets)
- Make indexing optional (user chooses which fields to index)
- Monitor memory usage in tests
- Provide index size estimation tool
Contingency: Support external indexes (e.g., SQLite) for low-memory environments.
Risk 3: Filter Pushdown Bugs
Probability: Medium | Impact: Critical
Description: Incorrect filter logic could return wrong results.
Mitigation:
- Extensive correctness testing (ground truth validation)
- Compare pushdown results vs post-filtering
- Add assertion checks in debug builds
- Fuzzing for edge cases
Contingency: Add "safe mode" that validates results against post-filtering.
Risk 4: Hybrid Scoring Tuning Difficulty
Probability: High | Impact: Low
Description: Users may struggle to tune α/β weights for hybrid scoring.
Mitigation:
- Provide automatic weight tuning (based on query logs)
- Document recommended defaults for common use cases
- Add visualization tools for score distributions
- Support A/B testing framework
Contingency: Default to pure neural scoring (α=1, β=0) if user unsure.
Risk 5: Cypher Integration Conflicts
Probability: Low | Impact: Medium
Description: Extending Cypher syntax could conflict with existing graph queries.
Mitigation:
- Careful syntax design (use reserved keywords)
- Version Cypher extensions separately
- Extensive compatibility testing
- Document syntax differences
Contingency: Use separate query language (e.g., extended SQL only) if conflicts arise.
Summary Risk Matrix
| Risk | Probability | Impact | Mitigation Priority |
|---|---|---|---|
| Query parser complexity | Medium | Medium | Medium |
| Index memory overhead | High | Medium | HIGH |
| Filter pushdown bugs | Medium | Critical | CRITICAL |
| Hybrid scoring tuning | High | Low | LOW |
| Cypher integration conflicts | Low | Medium | Medium |
Next Steps
- Prototype Phase 1: Build SQL parser and basic executor (1 week)
- Validate Queries: Test on simple queries, measure correctness (2 days)
- Add Metadata Indexes: Implement inverted + B-tree indexes (1 week)
- Benchmark Performance: Measure speedup vs post-filtering (3 days)
- Iterate: Optimize based on profiling (ongoing)
Key Decision Points:
- After Phase 1: Is query parsing fast enough? (<1ms target)
- After Phase 3: Does filter pushdown work correctly? (Zero regressions)
- After Phase 4: Does hybrid scoring improve relevance? (+10% NDCG required)
Go/No-Go Criteria:
- ✅ 5x+ speedup on filtered queries
- ✅ Zero correctness regressions
- ✅ Memory overhead <20%
- ✅ Improved relevance metrics