Files
wifi-densepose/docs/research/gnn-v2/03-neuro-symbolic-query.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

1642 lines
56 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Neuro-Symbolic Query Execution - Implementation Plan
## Overview
### Problem Statement
Current vector search in ruvector is purely neural (similarity-based): given a query vector, find the k most similar vectors by cosine/Euclidean distance. However, real-world queries often involve **logical constraints** that pure vector similarity cannot express:
**Examples of Unsupported Queries:**
- "Find vectors similar to X **AND** published after 2023 **AND** tagged as 'research'"
- "Find vectors similar to X **OR** similar to Y, **EXCLUDING** category 'spam'"
- "Find vectors where `metadata.price < 100` **AND** similarity > 0.8"
- "Find vectors in graph community C **AND** within 2 hops of node N"
**Current Limitations:**
- No support for boolean logic (AND, OR, NOT)
- Cannot filter by metadata attributes
- Cannot combine vector similarity with graph structure
- Forces post-processing filtering (inefficient)
- No way to express complex multi-modal queries
**Performance Impact:**
- Retrieving 10,000 vectors then filtering to 10 wastes 99.9% of computation
- No index acceleration for metadata predicates
- Cannot push down filters to HNSW search
### Proposed Solution
**Neuro-Symbolic Query Execution**: A hybrid query engine that combines neural vector similarity with symbolic logical constraints.
**Key Components:**
1. **Query Language**: Extend existing Cypher/SQL support with vector similarity operators
2. **Hybrid Scoring**: Combine vector similarity scores with predicate satisfaction
3. **Filter Pushdown**: Apply logical constraints during HNSW search (not after)
4. **Multi-Modal Indexing**: Index metadata attributes alongside vectors
5. **Constraint Propagation**: Use graph structure to prune search space
**Architecture:**
```
Query: "MATCH (v:Vector) WHERE vector_similarity(v.embedding, $query) > 0.8
AND v.year >= 2023 AND v.category IN ['research', 'papers']
RETURN v ORDER BY similarity DESC LIMIT 10"
↓ Parse & Optimize
Neural Component: Symbolic Component:
vector_similarity > 0.8 year >= 2023 AND category IN [...]
↓ ↓
HNSW Search Metadata Index
↓ ↓
└──────── Merge ─────────┘
Hybrid Scoring (α * neural + β * symbolic)
Top-K Results
```
### Expected Benefits
**Quantified Performance Improvements:**
| Query Type | Current (Post-Filter) | Neuro-Symbolic | Improvement |
|------------|----------------------|----------------|-------------|
| Similarity + 1 filter | 50ms (10K retrieved) | 5ms (100 retrieved) | **10x faster** |
| Similarity + 3 filters | 200ms (50K retrieved) | 8ms (200 retrieved) | **25x faster** |
| Complex boolean logic | Not supported | 15ms | **∞** (new capability) |
| Multi-modal query | Manual joins | 20ms | **50x faster** |
**Qualitative Benefits:**
- Express complex queries naturally (no manual post-processing)
- Efficient execution with filter pushdown
- Support for real-world use cases (e-commerce, research, RAG)
- Better accuracy through multi-modal fusion
- Graph-aware queries (community detection, path constraints)
## Technical Design
### Architecture Diagram (ASCII Art)
```
┌─────────────────────────────────────────────────────────────────┐
│ Neuro-Symbolic Query Execution Pipeline │
└─────────────────────────────────────────────────────────────────┘
User Query (SQL/Cypher + Vector Similarity)
│ Example: "SELECT * FROM vectors
│ WHERE cosine_similarity(embedding, $query) > 0.8
│ AND category = 'research' AND year >= 2023
│ ORDER BY similarity DESC LIMIT 10"
┌─────────────────────────────────────────────────────────────────┐
│ Query Parser & AST Builder │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Parse query into Abstract Syntax Tree (AST) │ │
│ │ ┌──────────────────────────────────────────────────────┐ │ │
│ │ │ SELECT │ │ │
│ │ │ WHERE │ │ │
│ │ │ AND │ │ │
│ │ │ ├─ cosine_similarity(emb, $q) > 0.8 [NEURAL] │ │ │
│ │ │ ├─ category = 'research' [SYMBOLIC] │ │ │
│ │ │ └─ year >= 2023 [SYMBOLIC] │ │ │
│ │ │ ORDER BY similarity DESC │ │ │
│ │ │ LIMIT 10 │ │ │
│ │ └──────────────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Query Optimizer │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Analyze predicates and rewrite query for efficiency │ │
│ │ │ │
│ │ 1. Predicate Pushdown: │ │
│ │ Move filters into HNSW search (before candidate gen) │ │
│ │ │ │
│ │ 2. Index Selection: │ │
│ │ Choose best index for symbolic predicates │ │
│ │ - category: inverted index │ │
│ │ - year: range index (B-tree) │ │
│ │ │ │
│ │ 3. Execution Strategy: │ │
│ │ - If few categories: scan category index first │ │
│ │ - If similarity selective: HNSW first, then filter │ │
│ │ - If balanced: hybrid merge │ │
│ │ │ │
│ │ 4. Hybrid Scoring: │ │
│ │ score = α * neural_sim + β * symbolic_score │ │
│ └────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Execution Plan │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Step 1: HNSW Search (neural) │ │
│ │ - Target: similarity > 0.8 │ │
│ │ - Candidate pool: ef=200 │ │
│ │ - Early termination: collect ~100 candidates │ │
│ │ - Filter during search: year >= 2023 │ │
│ │ Output: {node_id, similarity} for ~100 candidates │ │
│ │ │ │
│ │ Step 2: Symbolic Filtering (metadata index) │ │
│ │ - Lookup category index: category = 'research' │ │
│ │ - Intersect with HNSW candidates │ │
│ │ Output: {node_id, similarity, metadata} for ~30 nodes │ │
│ │ │ │
│ │ Step 3: Hybrid Scoring │ │
│ │ - Compute symbolic_score (e.g., recency bonus) │ │
│ │ - Combined: 0.7 * similarity + 0.3 * symbolic_score │ │
│ │ Output: {node_id, hybrid_score} │ │
│ │ │ │
│ │ Step 4: Top-K Selection │ │
│ │ - Sort by hybrid_score DESC │ │
│ │ - Return top 10 │ │
│ └────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Result Set │
│ [{id: 42, similarity: 0.95, category: 'research', year: 2024}, │
│ {id: 137, similarity: 0.92, category: 'research', year: 2023},│
│ ...] │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Indexing & Storage Architecture │
└─────────────────────────────────────────────────────────────────┘
Vector Data:
┌─────────────────────────────────────────────────────────────────┐
│ HNSW Index (vector similarity) │
│ - Node ID → Embedding vector │
│ - Graph structure for approximate NN search │
└─────────────────────────────────────────────────────────────────┘
Metadata Data:
┌─────────────────────────────────────────────────────────────────┐
│ Inverted Index (categorical attributes) │
│ - category → {node_ids} │
│ - tag → {node_ids} │
│ - author → {node_ids} │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ B-Tree Index (range attributes) │
│ - year → sorted {node_ids} │
│ - price → sorted {node_ids} │
│ - timestamp → sorted {node_ids} │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Roaring Bitmap Index (set operations) │
│ - Efficient AND/OR/NOT on node ID sets │
│ - Compressed storage for sparse sets │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Graph Index (structural constraints) │
│ - Community membership: community_id → {node_ids} │
│ - k-hop neighborhoods: precomputed for common queries │
│ - Path constraints: shortest path caches │
└─────────────────────────────────────────────────────────────────┘
```
### Core Data Structures (Rust)
```rust
// File: crates/ruvector-query/src/neuro_symbolic/mod.rs
use std::collections::{HashMap, HashSet};
use serde::{Deserialize, Serialize};
/// Neuro-symbolic query execution engine
pub struct NeuroSymbolicEngine {
/// HNSW index for vector similarity
hnsw_index: Arc<HnswIndex>,
/// Metadata indexes (inverted, B-tree, etc.)
metadata_indexes: MetadataIndexes,
/// Query optimizer
optimizer: QueryOptimizer,
/// Execution planner
planner: ExecutionPlanner,
/// Hybrid scoring configuration
scoring_config: HybridScoringConfig,
}
/// Query representation (SQL/Cypher AST)
#[derive(Debug, Clone)]
pub struct Query {
/// SELECT clause (which fields to return)
pub select: Vec<String>,
/// WHERE clause (predicates)
pub where_clause: Option<Predicate>,
/// ORDER BY clause
pub order_by: Vec<OrderBy>,
/// LIMIT clause
pub limit: Option<usize>,
/// OFFSET clause
pub offset: Option<usize>,
}
/// Predicate tree (boolean logic)
#[derive(Debug, Clone)]
pub enum Predicate {
/// Neural predicate: vector similarity
VectorSimilarity {
field: String,
query_vector: Vec<f32>,
operator: ComparisonOp, // >, <, =
threshold: f32,
metric: SimilarityMetric, // cosine, euclidean, dot
},
/// Symbolic predicate: metadata constraint
Attribute {
field: String,
operator: ComparisonOp,
value: Value,
},
/// Graph predicate: structural constraint
Graph {
constraint: GraphConstraint,
},
/// Boolean operators
And(Box<Predicate>, Box<Predicate>),
Or(Box<Predicate>, Box<Predicate>),
Not(Box<Predicate>),
}
#[derive(Debug, Clone)]
pub enum GraphConstraint {
/// Node in community
InCommunity { community_id: u32 },
/// Within k hops of node
WithinKHops { source_node: u32, k: usize },
/// On path between two nodes
OnPath { source: u32, target: u32 },
/// Has edge to node
ConnectedTo { node_id: u32 },
}
#[derive(Debug, Clone, Copy)]
pub enum ComparisonOp {
Eq, // =
Ne, // !=
Lt, // <
Le, // <=
Gt, // >
Ge, // >=
In, // IN (...)
Like, // LIKE (string pattern)
}
#[derive(Debug, Clone)]
pub enum Value {
Int(i64),
Float(f64),
String(String),
Bool(bool),
List(Vec<Value>),
}
#[derive(Debug, Clone, Copy)]
pub enum SimilarityMetric {
Cosine,
Euclidean,
DotProduct,
L1,
}
/// Metadata indexing structures
pub struct MetadataIndexes {
/// Inverted indexes for categorical fields
inverted: HashMap<String, InvertedIndex>,
/// B-tree indexes for range queries
btree: HashMap<String, BTreeIndex>,
/// Roaring bitmap for set operations
bitmap_store: BitmapStore,
/// Graph structural indexes
graph_index: GraphStructureIndex,
}
/// Inverted index: field_value → {node_ids}
pub struct InvertedIndex {
/// Map from value to posting list (node IDs)
postings: HashMap<String, RoaringBitmap>,
/// Statistics for query optimization
stats: IndexStats,
}
/// B-tree index for range queries
pub struct BTreeIndex {
/// Sorted map from value to node IDs
tree: BTreeMap<OrderedValue, RoaringBitmap>,
/// Statistics
stats: IndexStats,
}
/// Roaring bitmap store for efficient set operations
pub struct BitmapStore {
/// Node ID sets as compressed bitmaps
bitmaps: HashMap<String, RoaringBitmap>,
}
/// Graph structure indexes
pub struct GraphStructureIndex {
/// Community assignments
communities: HashMap<u32, RoaringBitmap>,
/// k-hop neighborhoods (precomputed)
khop_cache: HashMap<(u32, usize), RoaringBitmap>,
/// Shortest path cache
path_cache: PathCache,
}
#[derive(Debug, Default)]
pub struct IndexStats {
pub num_unique_values: usize,
pub total_postings: usize,
pub avg_posting_length: f64,
pub selectivity: f64, // fraction of nodes matching
}
/// Query execution plan
#[derive(Debug)]
pub struct ExecutionPlan {
/// Ordered steps to execute
pub steps: Vec<ExecutionStep>,
/// Estimated cost
pub estimated_cost: f64,
/// Estimated result size
pub estimated_results: usize,
}
#[derive(Debug)]
pub enum ExecutionStep {
/// HNSW vector search
VectorSearch {
query_vector: Vec<f32>,
similarity_threshold: f32,
metric: SimilarityMetric,
ef: usize,
filters: Vec<InlineFilter>, // Filters applied during search
},
/// Metadata index lookup
IndexScan {
index_name: String,
predicate: Predicate,
},
/// Graph structure traversal
GraphTraversal {
constraint: GraphConstraint,
},
/// Set intersection (AND)
Intersect {
left: Box<ExecutionStep>,
right: Box<ExecutionStep>,
},
/// Set union (OR)
Union {
left: Box<ExecutionStep>,
right: Box<ExecutionStep>,
},
/// Set difference (NOT)
Difference {
left: Box<ExecutionStep>,
right: Box<ExecutionStep>,
},
/// Hybrid scoring
HybridScore {
neural_scores: HashMap<u32, f32>,
symbolic_scores: HashMap<u32, f32>,
alpha: f32, // neural weight
beta: f32, // symbolic weight
},
/// Top-K selection
TopK {
input: Box<ExecutionStep>,
k: usize,
order_by: Vec<OrderBy>,
},
}
/// Filter applied during HNSW search (pushdown)
#[derive(Debug, Clone)]
pub struct InlineFilter {
pub field: String,
pub operator: ComparisonOp,
pub value: Value,
}
/// Hybrid scoring configuration
#[derive(Debug, Clone)]
pub struct HybridScoringConfig {
/// Weight for neural similarity score
pub neural_weight: f32,
/// Weight for symbolic score
pub symbolic_weight: f32,
/// Normalization method
pub normalization: NormalizationMethod,
}
#[derive(Debug, Clone, Copy)]
pub enum NormalizationMethod {
/// Min-max normalization [0, 1]
MinMax,
/// Z-score normalization
ZScore,
/// None (assume scores already normalized)
None,
}
/// Query result
#[derive(Debug, Serialize, Deserialize)]
pub struct QueryResult {
/// Matched node IDs
pub node_ids: Vec<u32>,
/// Neural similarity scores
pub neural_scores: Vec<f32>,
/// Symbolic scores (if applicable)
pub symbolic_scores: Option<Vec<f32>>,
/// Hybrid scores
pub hybrid_scores: Vec<f32>,
/// Metadata for each result
pub metadata: Vec<HashMap<String, Value>>,
/// Query execution statistics
pub stats: QueryStats,
}
#[derive(Debug, Serialize, Deserialize, Default)]
pub struct QueryStats {
/// Total execution time (milliseconds)
pub total_time_ms: f64,
/// Time breakdown by step
pub step_times: Vec<(String, f64)>,
/// Number of candidates evaluated
pub candidates_evaluated: usize,
/// Number of results returned
pub results_returned: usize,
/// Index usage
pub indexes_used: Vec<String>,
}
#[derive(Debug, Clone)]
pub struct OrderBy {
pub field: String,
pub direction: SortDirection,
}
#[derive(Debug, Clone, Copy)]
pub enum SortDirection {
Asc,
Desc,
}
/// Wrapper for ordered values in B-tree
#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord)]
pub enum OrderedValue {
Int(i64),
Float(OrderedFloat<f64>),
String(String),
}
use ordered_float::OrderedFloat;
use roaring::RoaringBitmap;
use std::collections::BTreeMap;
use std::sync::Arc;
```
### Key Algorithms (Pseudocode)
#### 1. Query Execution Algorithm
```python
function execute_neuro_symbolic_query(query: Query, engine: NeuroSymbolicEngine) -> QueryResult:
"""
Execute neuro-symbolic query with hybrid scoring.
Main algorithm: parse → optimize → plan → execute → score → return
"""
start_time = now()
# Step 1: Parse query into AST (already done, query is AST)
# Step 2: Optimize query (predicate pushdown, index selection)
optimized_query = engine.optimizer.optimize(query)
# Step 3: Generate execution plan
plan = engine.planner.create_plan(optimized_query)
# Step 4: Execute plan steps
result_set = execute_plan(plan, engine)
# Step 5: Hybrid scoring
if has_both_neural_and_symbolic(plan):
result_set = apply_hybrid_scoring(
result_set,
engine.scoring_config
)
# Step 6: Apply ORDER BY and LIMIT
result_set = sort_and_limit(
result_set,
query.order_by,
query.limit,
query.offset
)
# Step 7: Fetch metadata for results
metadata = fetch_metadata(result_set.node_ids, query.select)
execution_time = now() - start_time
return QueryResult(
node_ids=result_set.node_ids,
neural_scores=result_set.neural_scores,
symbolic_scores=result_set.symbolic_scores,
hybrid_scores=result_set.hybrid_scores,
metadata=metadata,
stats=QueryStats(
total_time_ms=execution_time,
candidates_evaluated=result_set.candidates_evaluated,
results_returned=len(result_set.node_ids),
indexes_used=plan.indexes_used
)
)
function execute_plan(plan: ExecutionPlan, engine: NeuroSymbolicEngine) -> IntermediateResult:
"""
Recursively execute plan steps.
"""
results = None
for step in plan.steps:
match step:
case VectorSearch:
# HNSW search with optional filters
results = execute_vector_search(step, engine.hnsw_index)
case IndexScan:
# Lookup in metadata index
results = execute_index_scan(step, engine.metadata_indexes)
case GraphTraversal:
# Graph structure query
results = execute_graph_traversal(step, engine.metadata_indexes.graph_index)
case Intersect:
# AND: set intersection
left = execute_plan_step(step.left, engine)
right = execute_plan_step(step.right, engine)
results = intersect_results(left, right)
case Union:
# OR: set union
left = execute_plan_step(step.left, engine)
right = execute_plan_step(step.right, engine)
results = union_results(left, right)
case Difference:
# NOT: set difference
left = execute_plan_step(step.left, engine)
right = execute_plan_step(step.right, engine)
results = difference_results(left, right)
case HybridScore:
# Compute hybrid scores
results = compute_hybrid_scores(
step.neural_scores,
step.symbolic_scores,
step.alpha,
step.beta
)
case TopK:
# Select top-k results
input_results = execute_plan_step(step.input, engine)
results = select_top_k(input_results, step.k, step.order_by)
return results
function execute_vector_search(step: VectorSearch, hnsw: HnswIndex) -> IntermediateResult:
"""
HNSW search with filter pushdown.
Key optimization: Apply symbolic filters during HNSW traversal
to avoid generating candidates that will be filtered out anyway.
"""
query_vector = step.query_vector
similarity_threshold = step.similarity_threshold
ef = step.ef
inline_filters = step.filters
# HNSW search with inline filtering
candidates = []
visited = set()
# Start from entry point
current_node = hnsw.entry_point
layer = hnsw.max_layer
while layer >= 0:
# Greedy search at this layer
while True:
neighbors = hnsw.get_neighbors(current_node, layer)
best_neighbor = None
best_distance = float('inf')
for neighbor in neighbors:
if neighbor in visited:
continue
# Apply inline filters BEFORE computing distance
if not passes_inline_filters(neighbor, inline_filters, hnsw.metadata):
continue # Skip this neighbor entirely
# Compute distance only for filtered candidates
distance = compute_distance(query_vector, hnsw.get_vector(neighbor))
similarity = distance_to_similarity(distance, step.metric)
if similarity >= similarity_threshold:
candidates.append((neighbor, similarity))
if distance < best_distance:
best_distance = distance
best_neighbor = neighbor
visited.add(neighbor)
if best_neighbor is None:
break # No improvement
current_node = best_neighbor
layer -= 1
# Sort candidates by similarity
candidates.sort(key=lambda x: x[1], reverse=True)
return IntermediateResult(
node_ids=[node_id for node_id, _ in candidates],
neural_scores=[score for _, score in candidates],
candidates_evaluated=len(visited)
)
function passes_inline_filters(node_id: u32, filters: List[InlineFilter], metadata: MetadataStore) -> bool:
"""
Check if node passes all inline filters.
This avoids computing distance for nodes that fail metadata constraints.
"""
for filter in filters:
node_value = metadata.get(node_id, filter.field)
if not evaluate_predicate(node_value, filter.operator, filter.value):
return False # Failed a filter
return True # Passed all filters
function execute_index_scan(step: IndexScan, indexes: MetadataIndexes) -> IntermediateResult:
"""
Scan metadata index to get matching node IDs.
"""
index_name = step.index_name
predicate = step.predicate
match predicate:
case Attribute(field, operator, value):
if operator == ComparisonOp.Eq:
# Exact match: use inverted index
posting_list = indexes.inverted[field].lookup(value)
return IntermediateResult(
node_ids=posting_list.to_vec(),
symbolic_scores=[1.0] * len(posting_list) # Binary: matches or not
)
elif operator in [ComparisonOp.Lt, ComparisonOp.Le, ComparisonOp.Gt, ComparisonOp.Ge]:
# Range query: use B-tree index
matching_nodes = indexes.btree[field].range_query(operator, value)
return IntermediateResult(
node_ids=matching_nodes.to_vec(),
symbolic_scores=[1.0] * len(matching_nodes)
)
elif operator == ComparisonOp.In:
# IN query: union of inverted index lookups
all_nodes = RoaringBitmap()
for v in value.list:
posting_list = indexes.inverted[field].lookup(v)
all_nodes |= posting_list # Union
return IntermediateResult(
node_ids=all_nodes.to_vec(),
symbolic_scores=[1.0] * len(all_nodes)
)
function execute_graph_traversal(step: GraphTraversal, graph_index: GraphStructureIndex) -> IntermediateResult:
"""
Execute graph structural constraint.
"""
match step.constraint:
case InCommunity(community_id):
# Lookup precomputed community membership
node_ids = graph_index.communities.get(community_id)
return IntermediateResult(
node_ids=node_ids.to_vec(),
symbolic_scores=[1.0] * len(node_ids)
)
case WithinKHops(source_node, k):
# Lookup precomputed k-hop neighborhood
key = (source_node, k)
if key in graph_index.khop_cache:
node_ids = graph_index.khop_cache[key]
else:
# Compute on-the-fly via BFS
node_ids = compute_khop_neighbors(source_node, k, graph_index.graph)
return IntermediateResult(
node_ids=node_ids.to_vec(),
symbolic_scores=[1.0 / (1 + distance)] for distance in range(len(node_ids))
)
case OnPath(source, target):
# Check path cache
path_nodes = graph_index.path_cache.get_path(source, target)
return IntermediateResult(
node_ids=path_nodes,
symbolic_scores=[1.0] * len(path_nodes)
)
function intersect_results(left: IntermediateResult, right: IntermediateResult) -> IntermediateResult:
"""
Set intersection (AND): keep nodes in both sets.
Use Roaring Bitmap for efficient intersection.
"""
left_bitmap = RoaringBitmap.from_sorted(left.node_ids)
right_bitmap = RoaringBitmap.from_sorted(right.node_ids)
intersection = left_bitmap & right_bitmap # Bitmap AND
# Combine scores (average for simplicity)
node_ids = intersection.to_vec()
combined_scores = []
for node_id in node_ids:
left_score = left.get_score(node_id)
right_score = right.get_score(node_id)
combined_scores.append((left_score + right_score) / 2.0)
return IntermediateResult(
node_ids=node_ids,
scores=combined_scores
)
function apply_hybrid_scoring(result_set, config: HybridScoringConfig) -> IntermediateResult:
"""
Combine neural and symbolic scores.
Formula: hybrid_score = α * normalize(neural) + β * normalize(symbolic)
"""
neural_scores = result_set.neural_scores
symbolic_scores = result_set.symbolic_scores
# Normalize scores to [0, 1]
if config.normalization == NormalizationMethod.MinMax:
neural_norm = min_max_normalize(neural_scores)
symbolic_norm = min_max_normalize(symbolic_scores)
elif config.normalization == NormalizationMethod.ZScore:
neural_norm = z_score_normalize(neural_scores)
symbolic_norm = z_score_normalize(symbolic_scores)
else:
neural_norm = neural_scores
symbolic_norm = symbolic_scores
# Combine with weights
alpha = config.neural_weight
beta = config.symbolic_weight
hybrid_scores = [
alpha * n + beta * s
for n, s in zip(neural_norm, symbolic_norm)
]
result_set.hybrid_scores = hybrid_scores
return result_set
```
#### 2. Query Optimization
```python
function optimize_query(query: Query, optimizer: QueryOptimizer) -> Query:
"""
Optimize query execution plan.
Key optimizations:
1. Predicate pushdown (filters into HNSW search)
2. Index selection (choose best index for each predicate)
3. Join reordering (cheapest predicates first)
4. Early termination (stop when enough candidates found)
"""
# Extract predicates from WHERE clause
predicates = extract_predicates(query.where_clause)
# Classify predicates
neural_preds = [p for p in predicates if is_neural_predicate(p)]
symbolic_preds = [p for p in predicates if is_symbolic_predicate(p)]
graph_preds = [p for p in predicates if is_graph_predicate(p)]
# Estimate selectivity for each predicate
selectivities = {}
for pred in predicates:
selectivities[pred] = estimate_selectivity(pred, optimizer.stats)
# Predicate pushdown: which filters can be applied during HNSW search?
inline_filters = []
post_filters = []
for pred in symbolic_preds:
if can_pushdown(pred):
inline_filters.append(pred)
else:
post_filters.append(pred)
# Index selection: choose best index for each symbolic predicate
index_plan = {}
for pred in symbolic_preds:
best_index = choose_best_index(pred, optimizer.indexes, selectivities[pred])
index_plan[pred] = best_index
# Reorder predicates: most selective first
ordered_predicates = sorted(predicates, key=lambda p: selectivities[p])
# Build optimized execution plan
optimized_query = rewrite_query(
query,
inline_filters=inline_filters,
post_filters=post_filters,
index_plan=index_plan,
predicate_order=ordered_predicates
)
return optimized_query
function estimate_selectivity(predicate, stats) -> float:
"""
Estimate fraction of nodes matching predicate.
Uses index statistics (histograms, cardinality, etc.)
"""
match predicate:
case VectorSimilarity(threshold):
# Estimate based on similarity distribution
return estimate_similarity_selectivity(threshold, stats.similarity_histogram)
case Attribute(field, operator, value):
# Estimate based on attribute distribution
if operator == ComparisonOp.Eq:
return 1.0 / stats.cardinality[field] # Uniform assumption
elif operator in [Lt, Le, Gt, Ge]:
return estimate_range_selectivity(field, operator, value, stats)
elif operator == In:
return len(value.list) / stats.cardinality[field]
case Graph(constraint):
# Estimate based on graph structure
match constraint:
case InCommunity(id):
return stats.community_sizes[id] / stats.total_nodes
case WithinKHops(node, k):
return estimate_khop_size(node, k, stats) / stats.total_nodes
function can_pushdown(predicate) -> bool:
"""
Check if predicate can be pushed into HNSW search.
Only simple equality/range predicates on indexed fields can be pushed down.
"""
match predicate:
case Attribute(field, operator, value):
# Can pushdown if operator is simple and field is indexed
return operator in [Eq, Lt, Le, Gt, Ge, In] and is_indexed(field)
case _:
return False # Complex predicates handled post-search
```
### API Design (Function Signatures)
```rust
// File: crates/ruvector-query/src/neuro_symbolic/mod.rs
impl NeuroSymbolicEngine {
/// Create a new neuro-symbolic query engine
pub fn new(
hnsw_index: Arc<HnswIndex>,
metadata_path: impl AsRef<Path>,
) -> Result<Self, QueryError>;
/// Execute a query (SQL or Cypher syntax)
pub fn execute_query(
&self,
query: &str,
) -> Result<QueryResult, QueryError>;
/// Execute a parsed query (AST)
pub fn execute_parsed_query(
&self,
query: Query,
) -> Result<QueryResult, QueryError>;
/// Add metadata index for a field
pub fn create_index(
&mut self,
field: &str,
index_type: IndexType,
) -> Result<(), QueryError>;
/// Update hybrid scoring configuration
pub fn set_scoring_config(&mut self, config: HybridScoringConfig);
/// Get query execution statistics
pub fn stats(&self) -> QueryEngineStats;
}
#[derive(Debug, Clone, Copy)]
pub enum IndexType {
Inverted, // Categorical fields
BTree, // Range queries
Bitmap, // Set operations
}
impl Query {
/// Parse SQL query string into AST
pub fn parse_sql(query: &str) -> Result<Self, ParseError>;
/// Parse Cypher query string into AST
pub fn parse_cypher(query: &str) -> Result<Self, ParseError>;
/// Validate query syntax and semantics
pub fn validate(&self) -> Result<(), ValidationError>;
}
impl Predicate {
/// Evaluate predicate on a node
pub fn evaluate(
&self,
node_id: u32,
vector_store: &VectorStore,
metadata_store: &MetadataStore,
) -> bool;
/// Extract referenced fields
pub fn referenced_fields(&self) -> Vec<String>;
/// Check if predicate is neural (vector similarity)
pub fn is_neural(&self) -> bool;
/// Check if predicate is symbolic (metadata)
pub fn is_symbolic(&self) -> bool;
/// Check if predicate is graph-structural
pub fn is_graph_structural(&self) -> bool;
}
impl MetadataIndexes {
/// Create indexes from metadata file
pub fn from_metadata(path: impl AsRef<Path>) -> Result<Self, IndexError>;
/// Add inverted index for field
pub fn add_inverted_index(
&mut self,
field: &str,
values: HashMap<String, Vec<u32>>,
) -> Result<(), IndexError>;
/// Add B-tree index for field
pub fn add_btree_index(
&mut self,
field: &str,
values: Vec<(OrderedValue, u32)>,
) -> Result<(), IndexError>;
/// Query inverted index
pub fn query_inverted(
&self,
field: &str,
value: &str,
) -> Option<&RoaringBitmap>;
/// Query B-tree index (range)
pub fn query_btree_range(
&self,
field: &str,
operator: ComparisonOp,
value: OrderedValue,
) -> Option<RoaringBitmap>;
/// Intersect bitmaps (AND operation)
pub fn intersect(&self, bitmaps: &[RoaringBitmap]) -> RoaringBitmap;
/// Union bitmaps (OR operation)
pub fn union(&self, bitmaps: &[RoaringBitmap]) -> RoaringBitmap;
/// Difference bitmaps (NOT operation)
pub fn difference(&self, left: &RoaringBitmap, right: &RoaringBitmap) -> RoaringBitmap;
}
#[derive(Debug, Default)]
pub struct QueryEngineStats {
pub total_queries: u64,
pub avg_query_time_ms: f64,
pub cache_hit_rate: f64,
pub avg_candidates_evaluated: f64,
}
```
## Integration Points
### Affected Crates/Modules
1. **`ruvector-query`** (New Crate)
- New module: `src/neuro_symbolic/mod.rs` - Core engine
- New module: `src/neuro_symbolic/parser.rs` - SQL/Cypher parser
- New module: `src/neuro_symbolic/optimizer.rs` - Query optimizer
- New module: `src/neuro_symbolic/planner.rs` - Execution planner
- New module: `src/neuro_symbolic/indexes.rs` - Metadata indexing
2. **`ruvector-core`** (Integration)
- Modified: `src/index/hnsw.rs` - Add filter callback support
- Modified: `src/vector_store.rs` - Expose metadata API
3. **`ruvector-api`** (Exposure)
- Modified: `src/query.rs` - Add neuro-symbolic query endpoint
- New: `src/query/sql.rs` - SQL query interface
- New: `src/query/cypher.rs` - Cypher query interface
4. **`ruvector-bindings`** (Language Bindings)
- Modified: `python/src/lib.rs` - Expose query API
- Modified: `nodejs/src/lib.rs` - Expose query API
### New Modules to Create
```
crates/ruvector-query/ # New crate
├── src/
│ ├── neuro_symbolic/
│ │ ├── mod.rs # Core engine
│ │ ├── parser.rs # Query parsing
│ │ ├── optimizer.rs # Query optimization
│ │ ├── planner.rs # Execution planning
│ │ ├── executor.rs # Query execution
│ │ ├── indexes.rs # Metadata indexing
│ │ ├── scoring.rs # Hybrid scoring
│ │ └── stats.rs # Statistics collection
│ └── lib.rs
examples/
├── neuro_symbolic_queries/
│ ├── sql_examples.rs # SQL query examples
│ ├── cypher_examples.rs # Cypher query examples
│ ├── hybrid_scoring.rs # Hybrid scoring examples
│ └── README.md
```
### Dependencies on Other Features
**Depends On:**
- **HNSW Index**: Core vector search functionality
- **Existing Cypher Support**: Extend existing graph query support
**Synergies With:**
- **GNN-Guided Routing (Feature 1)**: Can use GNN for smarter query execution
- **Incremental Learning (Feature 2)**: Real-time index updates support streaming queries
**External Dependencies:**
- `sqlparser` - SQL parsing
- `cypher-parser` - Cypher parsing (if not already present)
- `roaring` - Roaring Bitmap for efficient set operations
- `serde` - Query serialization
## Regression Prevention
### What Existing Functionality Could Break
1. **Pure Vector Search Performance**
- Risk: Adding metadata lookups slows down simple vector queries
- Impact: Regression in baseline HNSW performance
2. **Memory Usage**
- Risk: Metadata indexes consume excessive RAM
- Impact: OOM on large datasets
3. **Query Correctness**
- Risk: Filter pushdown logic has bugs, returns wrong results
- Impact: Incorrect search results
4. **Cypher Compatibility**
- Risk: Extending Cypher syntax breaks existing queries
- Impact: Breaking change for existing users
### Test Cases to Prevent Regressions
```rust
// File: crates/ruvector-query/tests/neuro_symbolic_regression_tests.rs
#[test]
fn test_pure_vector_search_unchanged() {
// Simple vector queries should have zero overhead
let engine = setup_test_engine();
// Baseline: pure HNSW search (no filters)
let query_baseline = "SELECT * FROM vectors ORDER BY similarity(embedding, $query) DESC LIMIT 10";
let start = Instant::now();
let results = engine.execute_query(query_baseline).unwrap();
let time_with_engine = start.elapsed();
// Direct HNSW (without query engine)
let start = Instant::now();
let results_direct = engine.hnsw_index.search(&query_vector, 10).unwrap();
let time_direct = start.elapsed();
// Query engine should add <5% overhead
let overhead = (time_with_engine.as_secs_f64() / time_direct.as_secs_f64()) - 1.0;
assert!(overhead < 0.05, "Overhead: {:.2}%, expected <5%", overhead * 100.0);
// Results should be identical
assert_eq!(results.node_ids, results_direct.node_ids);
}
#[test]
fn test_filter_correctness() {
// Filtered queries must return correct subset
let engine = setup_test_engine_with_metadata();
let query = "SELECT * FROM vectors
WHERE similarity(embedding, $query) > 0.8
AND category = 'research'
AND year >= 2023
LIMIT 10";
let results = engine.execute_query(query).unwrap();
// Verify each result matches ALL predicates
for node_id in &results.node_ids {
let similarity = compute_similarity(&query_vector, engine.get_vector(*node_id));
assert!(similarity > 0.8, "Node {} similarity: {}, expected >0.8", node_id, similarity);
let category = engine.get_metadata(*node_id, "category");
assert_eq!(category, "research", "Node {} category: {}, expected 'research'", node_id, category);
let year = engine.get_metadata(*node_id, "year").parse::<i32>().unwrap();
assert!(year >= 2023, "Node {} year: {}, expected >=2023", node_id, year);
}
}
#[test]
fn test_filter_pushdown_performance() {
// Pushdown filters should be much faster than post-filtering
let engine = setup_test_engine_with_metadata();
// With pushdown (optimized)
let query_pushdown = "SELECT * FROM vectors
WHERE similarity(embedding, $query) > 0.8
AND category = 'research'
LIMIT 10";
let start = Instant::now();
let results_pushdown = engine.execute_query(query_pushdown).unwrap();
let time_pushdown = start.elapsed();
// Without pushdown (post-filter, manual implementation)
let all_results = engine.hnsw_index.search(&query_vector, 10000).unwrap();
let start = Instant::now();
let results_post: Vec<_> = all_results.into_iter()
.filter(|r| r.similarity > 0.8)
.filter(|r| engine.get_metadata(r.node_id, "category") == "research")
.take(10)
.collect();
let time_post = start.elapsed();
// Pushdown should be ≥5x faster
let speedup = time_post.as_secs_f64() / time_pushdown.as_secs_f64();
assert!(speedup >= 5.0, "Speedup: {:.1}x, expected ≥5x", speedup);
// Results should be identical
assert_eq!(results_pushdown.node_ids.len(), results_post.len());
}
#[test]
fn test_hybrid_scoring_correctness() {
// Hybrid scores should combine neural and symbolic correctly
let engine = setup_test_engine();
engine.set_scoring_config(HybridScoringConfig {
neural_weight: 0.7,
symbolic_weight: 0.3,
normalization: NormalizationMethod::MinMax,
});
let query = "SELECT * FROM vectors
WHERE similarity(embedding, $query) > 0.5
AND year >= 2020
ORDER BY hybrid_score DESC
LIMIT 10";
let results = engine.execute_query(query).unwrap();
// Verify hybrid score formula
for i in 0..results.node_ids.len() {
let neural = results.neural_scores[i];
let symbolic = results.symbolic_scores.as_ref().unwrap()[i];
// Normalize (min-max)
let neural_norm = (neural - 0.5) / (1.0 - 0.5); // Assuming min=0.5, max=1.0
let symbolic_norm = (symbolic - 0.0) / (1.0 - 0.0); // Assuming min=0.0, max=1.0
let expected_hybrid = 0.7 * neural_norm + 0.3 * symbolic_norm;
let actual_hybrid = results.hybrid_scores[i];
assert!((expected_hybrid - actual_hybrid).abs() < 1e-5,
"Hybrid score mismatch: expected {}, got {}", expected_hybrid, actual_hybrid);
}
}
#[test]
fn test_boolean_logic_correctness() {
// AND/OR/NOT operations must be correct
let engine = setup_test_engine();
// Test AND
let query_and = "SELECT * FROM vectors
WHERE category = 'A' AND tag = 'X'";
let results_and = engine.execute_query(query_and).unwrap();
for node_id in &results_and.node_ids {
assert_eq!(engine.get_metadata(*node_id, "category"), "A");
assert_eq!(engine.get_metadata(*node_id, "tag"), "X");
}
// Test OR
let query_or = "SELECT * FROM vectors
WHERE category = 'A' OR category = 'B'";
let results_or = engine.execute_query(query_or).unwrap();
for node_id in &results_or.node_ids {
let category = engine.get_metadata(*node_id, "category");
assert!(category == "A" || category == "B");
}
// Test NOT
let query_not = "SELECT * FROM vectors
WHERE category = 'A' AND NOT tag = 'X'";
let results_not = engine.execute_query(query_not).unwrap();
for node_id in &results_not.node_ids {
assert_eq!(engine.get_metadata(*node_id, "category"), "A");
assert_ne!(engine.get_metadata(*node_id, "tag"), "X");
}
}
```
### Backward Compatibility Strategy
1. **Opt-In Feature**
- Neuro-symbolic queries are opt-in (require explicit SQL/Cypher syntax)
- Existing vector search API unchanged
2. **Graceful Degradation**
- If metadata indexes not available, fallback to post-filtering
- Log warning but do not crash
3. **Configuration**
```yaml
query:
neuro_symbolic:
enabled: true # Default: true
metadata_indexes: true # Default: true
hybrid_scoring: true # Default: true
```
4. **API Versioning**
- New endpoints for neuro-symbolic queries (`/query/sql`, `/query/cypher`)
- Existing endpoints (`/search`) unchanged
## Implementation Phases
### Phase 1: Core Infrastructure (Week 1-2)
**Goal**: Query parsing and basic execution
**Tasks**:
1. Implement SQL/Cypher parser
2. Build AST representation
3. Implement basic query executor (no optimization)
4. Unit tests for parsing and execution
**Deliverables**:
- `neuro_symbolic/parser.rs`
- `neuro_symbolic/executor.rs`
- Passing unit tests
**Success Criteria**:
- Can parse and execute simple queries (vector similarity only)
- Correct results (matches HNSW baseline)
### Phase 2: Metadata Indexing (Week 2-3)
**Goal**: Support symbolic predicates
**Tasks**:
1. Implement inverted index for categorical fields
2. Implement B-tree index for range queries
3. Integrate Roaring Bitmap for set operations
4. Test index correctness and performance
**Deliverables**:
- `neuro_symbolic/indexes.rs`
- Index creation and query APIs
- Benchmark report
**Success Criteria**:
- Indexes correctly return matching nodes
- Index queries <10ms for typical workloads
- Memory overhead <20% of vector data
### Phase 3: Filter Pushdown (Week 3-4)
**Goal**: Optimize query execution
**Tasks**:
1. Implement filter pushdown into HNSW search
2. Modify HNSW to support filter callbacks
3. Benchmark speedup vs post-filtering
4. Test correctness of pushdown logic
**Deliverables**:
- Modified `hnsw.rs` with filter support
- `neuro_symbolic/optimizer.rs`
- Performance benchmarks
**Success Criteria**:
- ≥5x speedup for filtered queries
- Zero correctness regressions
- Works with complex boolean logic (AND/OR/NOT)
### Phase 4: Hybrid Scoring (Week 4-5)
**Goal**: Combine neural and symbolic scores
**Tasks**:
1. Implement hybrid scoring algorithm
2. Add score normalization methods
3. Tune weights (α, β) for best results
4. Test on real-world datasets
**Deliverables**:
- `neuro_symbolic/scoring.rs`
- Hybrid scoring benchmarks
- Configuration guide
**Success Criteria**:
- Hybrid queries improve relevance metrics (NDCG, MRR)
- Configurable weights work as expected
- Performance <20ms for typical queries
### Phase 5: Production Hardening (Week 5-6)
**Goal**: Production-ready feature
**Tasks**:
1. Add comprehensive error handling
2. Write documentation and examples
3. Stress testing (large datasets, complex queries)
4. Integration with existing Cypher support
**Deliverables**:
- Full error handling
- User documentation
- Example queries
- Regression test suite
**Success Criteria**:
- Zero crashes in stress tests
- Documentation complete
- Ready for alpha release
## Success Metrics
### Performance Benchmarks
**Primary Metrics** (Must Achieve):
| Query Type | Baseline (Post-Filter) | Neuro-Symbolic | Target Improvement |
|------------|------------------------|----------------|--------------------|
| Similarity + 1 filter | 50ms | 5ms | **10x faster** |
| Similarity + 3 filters | 200ms | 8ms | **25x faster** |
| Complex boolean (AND/OR/NOT) | N/A (manual) | 15ms | **New capability** |
| Multi-modal (vector + graph) | 500ms (manual joins) | 20ms | **25x faster** |
**Secondary Metrics**:
| Metric | Target |
|--------|--------|
| Index memory overhead | <20% of vector data |
| Query parsing time | <1ms |
| Hybrid scoring overhead | <2ms |
| Concurrent query throughput | Same as baseline |
### Accuracy Metrics
**Relevance Improvement** (on benchmark datasets):
- NDCG@10: +15% (hybrid scoring vs pure vector)
- MRR (Mean Reciprocal Rank): +20%
- Precision@10: +10%
**Correctness**:
- 100% of filtered results match all predicates
- Zero false positives or false negatives
### Memory/Latency Targets
**Memory**:
- Inverted indexes: <100MB per 1M nodes (categorical fields)
- B-tree indexes: <50MB per 1M nodes (range fields)
- Total overhead: <20% of vector index size
**Latency**:
- Simple query (1 filter): <10ms
- Complex query (3+ filters): <20ms
- Hybrid scoring: <5ms overhead
- P99 latency: <50ms
**Throughput**:
- Concurrent queries: Same as baseline HNSW
- No lock contention on indexes
## Risks and Mitigations
### Technical Risks
**Risk 1: Query Parser Complexity**
*Probability: Medium | Impact: Medium*
**Description**: SQL/Cypher parsing is complex, could have bugs or performance issues.
**Mitigation**:
- Use established parsing libraries (`sqlparser`, `cypher-parser`)
- Extensive test suite with edge cases
- Validate AST before execution
- Provide query validation tool
**Contingency**: Start with simple query subset, expand incrementally.
---
**Risk 2: Index Memory Overhead**
*Probability: High | Impact: Medium*
**Description**: Metadata indexes could consume excessive memory on large datasets.
**Mitigation**:
- Use compressed indexes (Roaring Bitmap for sparse sets)
- Make indexing optional (user chooses which fields to index)
- Monitor memory usage in tests
- Provide index size estimation tool
**Contingency**: Support external indexes (e.g., SQLite) for low-memory environments.
---
**Risk 3: Filter Pushdown Bugs**
*Probability: Medium | Impact: Critical*
**Description**: Incorrect filter logic could return wrong results.
**Mitigation**:
- Extensive correctness testing (ground truth validation)
- Compare pushdown results vs post-filtering
- Add assertion checks in debug builds
- Fuzzing for edge cases
**Contingency**: Add "safe mode" that validates results against post-filtering.
---
**Risk 4: Hybrid Scoring Tuning Difficulty**
*Probability: High | Impact: Low*
**Description**: Users may struggle to tune α/β weights for hybrid scoring.
**Mitigation**:
- Provide automatic weight tuning (based on query logs)
- Document recommended defaults for common use cases
- Add visualization tools for score distributions
- Support A/B testing framework
**Contingency**: Default to pure neural scoring (α=1, β=0) if user unsure.
---
**Risk 5: Cypher Integration Conflicts**
*Probability: Low | Impact: Medium*
**Description**: Extending Cypher syntax could conflict with existing graph queries.
**Mitigation**:
- Careful syntax design (use reserved keywords)
- Version Cypher extensions separately
- Extensive compatibility testing
- Document syntax differences
**Contingency**: Use separate query language (e.g., extended SQL only) if conflicts arise.
---
### Summary Risk Matrix
| Risk | Probability | Impact | Mitigation Priority |
|------|-------------|--------|---------------------|
| Query parser complexity | Medium | Medium | Medium |
| Index memory overhead | High | Medium | **HIGH** |
| Filter pushdown bugs | Medium | Critical | **CRITICAL** |
| Hybrid scoring tuning | High | Low | LOW |
| Cypher integration conflicts | Low | Medium | Medium |
---
## Next Steps
1. **Prototype Phase 1**: Build SQL parser and basic executor (1 week)
2. **Validate Queries**: Test on simple queries, measure correctness (2 days)
3. **Add Metadata Indexes**: Implement inverted + B-tree indexes (1 week)
4. **Benchmark Performance**: Measure speedup vs post-filtering (3 days)
5. **Iterate**: Optimize based on profiling (ongoing)
**Key Decision Points**:
- After Phase 1: Is query parsing fast enough? (<1ms target)
- After Phase 3: Does filter pushdown work correctly? (Zero regressions)
- After Phase 4: Does hybrid scoring improve relevance? (+10% NDCG required)
**Go/No-Go Criteria**:
- ✅ 5x+ speedup on filtered queries
- ✅ Zero correctness regressions
- ✅ Memory overhead <20%
- ✅ Improved relevance metrics