Files
wifi-densepose/vendor/ruvector/docs/research/gnn-v2/03-neuro-symbolic-query.md

56 KiB
Raw Blame History

Neuro-Symbolic Query Execution - Implementation Plan

Overview

Problem Statement

Current vector search in ruvector is purely neural (similarity-based): given a query vector, find the k most similar vectors by cosine/Euclidean distance. However, real-world queries often involve logical constraints that pure vector similarity cannot express:

Examples of Unsupported Queries:

  • "Find vectors similar to X AND published after 2023 AND tagged as 'research'"
  • "Find vectors similar to X OR similar to Y, EXCLUDING category 'spam'"
  • "Find vectors where metadata.price < 100 AND similarity > 0.8"
  • "Find vectors in graph community C AND within 2 hops of node N"

Current Limitations:

  • No support for boolean logic (AND, OR, NOT)
  • Cannot filter by metadata attributes
  • Cannot combine vector similarity with graph structure
  • Forces post-processing filtering (inefficient)
  • No way to express complex multi-modal queries

Performance Impact:

  • Retrieving 10,000 vectors then filtering to 10 wastes 99.9% of computation
  • No index acceleration for metadata predicates
  • Cannot push down filters to HNSW search

Proposed Solution

Neuro-Symbolic Query Execution: A hybrid query engine that combines neural vector similarity with symbolic logical constraints.

Key Components:

  1. Query Language: Extend existing Cypher/SQL support with vector similarity operators
  2. Hybrid Scoring: Combine vector similarity scores with predicate satisfaction
  3. Filter Pushdown: Apply logical constraints during HNSW search (not after)
  4. Multi-Modal Indexing: Index metadata attributes alongside vectors
  5. Constraint Propagation: Use graph structure to prune search space

Architecture:

Query: "MATCH (v:Vector) WHERE vector_similarity(v.embedding, $query) > 0.8
        AND v.year >= 2023 AND v.category IN ['research', 'papers']
        RETURN v ORDER BY similarity DESC LIMIT 10"

      ↓ Parse & Optimize

Neural Component:        Symbolic Component:
vector_similarity > 0.8  year >= 2023 AND category IN [...]
      ↓                        ↓
  HNSW Search            Metadata Index
      ↓                        ↓
      └──────── Merge ─────────┘
               ↓
        Hybrid Scoring (α * neural + β * symbolic)
               ↓
        Top-K Results

Expected Benefits

Quantified Performance Improvements:

Query Type Current (Post-Filter) Neuro-Symbolic Improvement
Similarity + 1 filter 50ms (10K retrieved) 5ms (100 retrieved) 10x faster
Similarity + 3 filters 200ms (50K retrieved) 8ms (200 retrieved) 25x faster
Complex boolean logic Not supported 15ms (new capability)
Multi-modal query Manual joins 20ms 50x faster

Qualitative Benefits:

  • Express complex queries naturally (no manual post-processing)
  • Efficient execution with filter pushdown
  • Support for real-world use cases (e-commerce, research, RAG)
  • Better accuracy through multi-modal fusion
  • Graph-aware queries (community detection, path constraints)

Technical Design

Architecture Diagram (ASCII Art)

┌─────────────────────────────────────────────────────────────────┐
│              Neuro-Symbolic Query Execution Pipeline             │
└─────────────────────────────────────────────────────────────────┘

User Query (SQL/Cypher + Vector Similarity)
     │
     │  Example: "SELECT * FROM vectors
     │             WHERE cosine_similarity(embedding, $query) > 0.8
     │             AND category = 'research' AND year >= 2023
     │             ORDER BY similarity DESC LIMIT 10"
     │
     ▼
┌─────────────────────────────────────────────────────────────────┐
│  Query Parser & AST Builder                                      │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │  Parse query into Abstract Syntax Tree (AST)               │ │
│  │  ┌──────────────────────────────────────────────────────┐  │ │
│  │  │ SELECT                                               │  │ │
│  │  │   WHERE                                              │  │ │
│  │  │     AND                                              │  │ │
│  │  │       ├─ cosine_similarity(emb, $q) > 0.8 [NEURAL]  │  │ │
│  │  │       ├─ category = 'research'        [SYMBOLIC]    │  │ │
│  │  │       └─ year >= 2023                 [SYMBOLIC]    │  │ │
│  │  │   ORDER BY similarity DESC                           │  │ │
│  │  │   LIMIT 10                                           │  │ │
│  │  └──────────────────────────────────────────────────────┘  │ │
│  └────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
     │
     ▼
┌─────────────────────────────────────────────────────────────────┐
│  Query Optimizer                                                 │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │  Analyze predicates and rewrite query for efficiency      │ │
│  │                                                             │ │
│  │  1. Predicate Pushdown:                                    │ │
│  │     Move filters into HNSW search (before candidate gen)   │ │
│  │                                                             │ │
│  │  2. Index Selection:                                       │ │
│  │     Choose best index for symbolic predicates              │ │
│  │     - category: inverted index                             │ │
│  │     - year: range index (B-tree)                           │ │
│  │                                                             │ │
│  │  3. Execution Strategy:                                    │ │
│  │     - If few categories: scan category index first         │ │
│  │     - If similarity selective: HNSW first, then filter     │ │
│  │     - If balanced: hybrid merge                            │ │
│  │                                                             │ │
│  │  4. Hybrid Scoring:                                        │ │
│  │     score = α * neural_sim + β * symbolic_score            │ │
│  └────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
     │
     ▼
┌─────────────────────────────────────────────────────────────────┐
│  Execution Plan                                                  │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │  Step 1: HNSW Search (neural)                              │ │
│  │    - Target: similarity > 0.8                              │ │
│  │    - Candidate pool: ef=200                                │ │
│  │    - Early termination: collect ~100 candidates            │ │
│  │    - Filter during search: year >= 2023                    │ │
│  │    Output: {node_id, similarity} for ~100 candidates       │ │
│  │                                                             │ │
│  │  Step 2: Symbolic Filtering (metadata index)               │ │
│  │    - Lookup category index: category = 'research'          │ │
│  │    - Intersect with HNSW candidates                        │ │
│  │    Output: {node_id, similarity, metadata} for ~30 nodes   │ │
│  │                                                             │ │
│  │  Step 3: Hybrid Scoring                                    │ │
│  │    - Compute symbolic_score (e.g., recency bonus)          │ │
│  │    - Combined: 0.7 * similarity + 0.3 * symbolic_score     │ │
│  │    Output: {node_id, hybrid_score}                         │ │
│  │                                                             │ │
│  │  Step 4: Top-K Selection                                   │ │
│  │    - Sort by hybrid_score DESC                             │ │
│  │    - Return top 10                                         │ │
│  └────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
     │
     ▼
┌─────────────────────────────────────────────────────────────────┐
│  Result Set                                                      │
│  [{id: 42, similarity: 0.95, category: 'research', year: 2024}, │
│   {id: 137, similarity: 0.92, category: 'research', year: 2023},│
│   ...]                                                           │
└─────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐
│              Indexing & Storage Architecture                     │
└─────────────────────────────────────────────────────────────────┘

Vector Data:
┌─────────────────────────────────────────────────────────────────┐
│  HNSW Index (vector similarity)                                  │
│  - Node ID → Embedding vector                                   │
│  - Graph structure for approximate NN search                    │
└─────────────────────────────────────────────────────────────────┘

Metadata Data:
┌─────────────────────────────────────────────────────────────────┐
│  Inverted Index (categorical attributes)                        │
│  - category → {node_ids}                                        │
│  - tag → {node_ids}                                             │
│  - author → {node_ids}                                          │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│  B-Tree Index (range attributes)                                │
│  - year → sorted {node_ids}                                     │
│  - price → sorted {node_ids}                                    │
│  - timestamp → sorted {node_ids}                                │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│  Roaring Bitmap Index (set operations)                          │
│  - Efficient AND/OR/NOT on node ID sets                         │
│  - Compressed storage for sparse sets                           │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│  Graph Index (structural constraints)                           │
│  - Community membership: community_id → {node_ids}              │
│  - k-hop neighborhoods: precomputed for common queries          │
│  - Path constraints: shortest path caches                       │
└─────────────────────────────────────────────────────────────────┘

Core Data Structures (Rust)

// File: crates/ruvector-query/src/neuro_symbolic/mod.rs

use std::collections::{HashMap, HashSet};
use serde::{Deserialize, Serialize};

/// Neuro-symbolic query execution engine
pub struct NeuroSymbolicEngine {
    /// HNSW index for vector similarity
    hnsw_index: Arc<HnswIndex>,

    /// Metadata indexes (inverted, B-tree, etc.)
    metadata_indexes: MetadataIndexes,

    /// Query optimizer
    optimizer: QueryOptimizer,

    /// Execution planner
    planner: ExecutionPlanner,

    /// Hybrid scoring configuration
    scoring_config: HybridScoringConfig,
}

/// Query representation (SQL/Cypher AST)
#[derive(Debug, Clone)]
pub struct Query {
    /// SELECT clause (which fields to return)
    pub select: Vec<String>,

    /// WHERE clause (predicates)
    pub where_clause: Option<Predicate>,

    /// ORDER BY clause
    pub order_by: Vec<OrderBy>,

    /// LIMIT clause
    pub limit: Option<usize>,

    /// OFFSET clause
    pub offset: Option<usize>,
}

/// Predicate tree (boolean logic)
#[derive(Debug, Clone)]
pub enum Predicate {
    /// Neural predicate: vector similarity
    VectorSimilarity {
        field: String,
        query_vector: Vec<f32>,
        operator: ComparisonOp,  // >, <, =
        threshold: f32,
        metric: SimilarityMetric,  // cosine, euclidean, dot
    },

    /// Symbolic predicate: metadata constraint
    Attribute {
        field: String,
        operator: ComparisonOp,
        value: Value,
    },

    /// Graph predicate: structural constraint
    Graph {
        constraint: GraphConstraint,
    },

    /// Boolean operators
    And(Box<Predicate>, Box<Predicate>),
    Or(Box<Predicate>, Box<Predicate>),
    Not(Box<Predicate>),
}

#[derive(Debug, Clone)]
pub enum GraphConstraint {
    /// Node in community
    InCommunity { community_id: u32 },

    /// Within k hops of node
    WithinKHops { source_node: u32, k: usize },

    /// On path between two nodes
    OnPath { source: u32, target: u32 },

    /// Has edge to node
    ConnectedTo { node_id: u32 },
}

#[derive(Debug, Clone, Copy)]
pub enum ComparisonOp {
    Eq,    // =
    Ne,    // !=
    Lt,    // <
    Le,    // <=
    Gt,    // >
    Ge,    // >=
    In,    // IN (...)
    Like,  // LIKE (string pattern)
}

#[derive(Debug, Clone)]
pub enum Value {
    Int(i64),
    Float(f64),
    String(String),
    Bool(bool),
    List(Vec<Value>),
}

#[derive(Debug, Clone, Copy)]
pub enum SimilarityMetric {
    Cosine,
    Euclidean,
    DotProduct,
    L1,
}

/// Metadata indexing structures
pub struct MetadataIndexes {
    /// Inverted indexes for categorical fields
    inverted: HashMap<String, InvertedIndex>,

    /// B-tree indexes for range queries
    btree: HashMap<String, BTreeIndex>,

    /// Roaring bitmap for set operations
    bitmap_store: BitmapStore,

    /// Graph structural indexes
    graph_index: GraphStructureIndex,
}

/// Inverted index: field_value → {node_ids}
pub struct InvertedIndex {
    /// Map from value to posting list (node IDs)
    postings: HashMap<String, RoaringBitmap>,

    /// Statistics for query optimization
    stats: IndexStats,
}

/// B-tree index for range queries
pub struct BTreeIndex {
    /// Sorted map from value to node IDs
    tree: BTreeMap<OrderedValue, RoaringBitmap>,

    /// Statistics
    stats: IndexStats,
}

/// Roaring bitmap store for efficient set operations
pub struct BitmapStore {
    /// Node ID sets as compressed bitmaps
    bitmaps: HashMap<String, RoaringBitmap>,
}

/// Graph structure indexes
pub struct GraphStructureIndex {
    /// Community assignments
    communities: HashMap<u32, RoaringBitmap>,

    /// k-hop neighborhoods (precomputed)
    khop_cache: HashMap<(u32, usize), RoaringBitmap>,

    /// Shortest path cache
    path_cache: PathCache,
}

#[derive(Debug, Default)]
pub struct IndexStats {
    pub num_unique_values: usize,
    pub total_postings: usize,
    pub avg_posting_length: f64,
    pub selectivity: f64,  // fraction of nodes matching
}

/// Query execution plan
#[derive(Debug)]
pub struct ExecutionPlan {
    /// Ordered steps to execute
    pub steps: Vec<ExecutionStep>,

    /// Estimated cost
    pub estimated_cost: f64,

    /// Estimated result size
    pub estimated_results: usize,
}

#[derive(Debug)]
pub enum ExecutionStep {
    /// HNSW vector search
    VectorSearch {
        query_vector: Vec<f32>,
        similarity_threshold: f32,
        metric: SimilarityMetric,
        ef: usize,
        filters: Vec<InlineFilter>,  // Filters applied during search
    },

    /// Metadata index lookup
    IndexScan {
        index_name: String,
        predicate: Predicate,
    },

    /// Graph structure traversal
    GraphTraversal {
        constraint: GraphConstraint,
    },

    /// Set intersection (AND)
    Intersect {
        left: Box<ExecutionStep>,
        right: Box<ExecutionStep>,
    },

    /// Set union (OR)
    Union {
        left: Box<ExecutionStep>,
        right: Box<ExecutionStep>,
    },

    /// Set difference (NOT)
    Difference {
        left: Box<ExecutionStep>,
        right: Box<ExecutionStep>,
    },

    /// Hybrid scoring
    HybridScore {
        neural_scores: HashMap<u32, f32>,
        symbolic_scores: HashMap<u32, f32>,
        alpha: f32,  // neural weight
        beta: f32,   // symbolic weight
    },

    /// Top-K selection
    TopK {
        input: Box<ExecutionStep>,
        k: usize,
        order_by: Vec<OrderBy>,
    },
}

/// Filter applied during HNSW search (pushdown)
#[derive(Debug, Clone)]
pub struct InlineFilter {
    pub field: String,
    pub operator: ComparisonOp,
    pub value: Value,
}

/// Hybrid scoring configuration
#[derive(Debug, Clone)]
pub struct HybridScoringConfig {
    /// Weight for neural similarity score
    pub neural_weight: f32,

    /// Weight for symbolic score
    pub symbolic_weight: f32,

    /// Normalization method
    pub normalization: NormalizationMethod,
}

#[derive(Debug, Clone, Copy)]
pub enum NormalizationMethod {
    /// Min-max normalization [0, 1]
    MinMax,

    /// Z-score normalization
    ZScore,

    /// None (assume scores already normalized)
    None,
}

/// Query result
#[derive(Debug, Serialize, Deserialize)]
pub struct QueryResult {
    /// Matched node IDs
    pub node_ids: Vec<u32>,

    /// Neural similarity scores
    pub neural_scores: Vec<f32>,

    /// Symbolic scores (if applicable)
    pub symbolic_scores: Option<Vec<f32>>,

    /// Hybrid scores
    pub hybrid_scores: Vec<f32>,

    /// Metadata for each result
    pub metadata: Vec<HashMap<String, Value>>,

    /// Query execution statistics
    pub stats: QueryStats,
}

#[derive(Debug, Serialize, Deserialize, Default)]
pub struct QueryStats {
    /// Total execution time (milliseconds)
    pub total_time_ms: f64,

    /// Time breakdown by step
    pub step_times: Vec<(String, f64)>,

    /// Number of candidates evaluated
    pub candidates_evaluated: usize,

    /// Number of results returned
    pub results_returned: usize,

    /// Index usage
    pub indexes_used: Vec<String>,
}

#[derive(Debug, Clone)]
pub struct OrderBy {
    pub field: String,
    pub direction: SortDirection,
}

#[derive(Debug, Clone, Copy)]
pub enum SortDirection {
    Asc,
    Desc,
}

/// Wrapper for ordered values in B-tree
#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord)]
pub enum OrderedValue {
    Int(i64),
    Float(OrderedFloat<f64>),
    String(String),
}

use ordered_float::OrderedFloat;
use roaring::RoaringBitmap;
use std::collections::BTreeMap;
use std::sync::Arc;

Key Algorithms (Pseudocode)

1. Query Execution Algorithm

function execute_neuro_symbolic_query(query: Query, engine: NeuroSymbolicEngine) -> QueryResult:
    """
    Execute neuro-symbolic query with hybrid scoring.

    Main algorithm: parse → optimize → plan → execute → score → return
    """
    start_time = now()

    # Step 1: Parse query into AST (already done, query is AST)
    # Step 2: Optimize query (predicate pushdown, index selection)
    optimized_query = engine.optimizer.optimize(query)

    # Step 3: Generate execution plan
    plan = engine.planner.create_plan(optimized_query)

    # Step 4: Execute plan steps
    result_set = execute_plan(plan, engine)

    # Step 5: Hybrid scoring
    if has_both_neural_and_symbolic(plan):
        result_set = apply_hybrid_scoring(
            result_set,
            engine.scoring_config
        )

    # Step 6: Apply ORDER BY and LIMIT
    result_set = sort_and_limit(
        result_set,
        query.order_by,
        query.limit,
        query.offset
    )

    # Step 7: Fetch metadata for results
    metadata = fetch_metadata(result_set.node_ids, query.select)

    execution_time = now() - start_time

    return QueryResult(
        node_ids=result_set.node_ids,
        neural_scores=result_set.neural_scores,
        symbolic_scores=result_set.symbolic_scores,
        hybrid_scores=result_set.hybrid_scores,
        metadata=metadata,
        stats=QueryStats(
            total_time_ms=execution_time,
            candidates_evaluated=result_set.candidates_evaluated,
            results_returned=len(result_set.node_ids),
            indexes_used=plan.indexes_used
        )
    )


function execute_plan(plan: ExecutionPlan, engine: NeuroSymbolicEngine) -> IntermediateResult:
    """
    Recursively execute plan steps.
    """
    results = None

    for step in plan.steps:
        match step:
            case VectorSearch:
                # HNSW search with optional filters
                results = execute_vector_search(step, engine.hnsw_index)

            case IndexScan:
                # Lookup in metadata index
                results = execute_index_scan(step, engine.metadata_indexes)

            case GraphTraversal:
                # Graph structure query
                results = execute_graph_traversal(step, engine.metadata_indexes.graph_index)

            case Intersect:
                # AND: set intersection
                left = execute_plan_step(step.left, engine)
                right = execute_plan_step(step.right, engine)
                results = intersect_results(left, right)

            case Union:
                # OR: set union
                left = execute_plan_step(step.left, engine)
                right = execute_plan_step(step.right, engine)
                results = union_results(left, right)

            case Difference:
                # NOT: set difference
                left = execute_plan_step(step.left, engine)
                right = execute_plan_step(step.right, engine)
                results = difference_results(left, right)

            case HybridScore:
                # Compute hybrid scores
                results = compute_hybrid_scores(
                    step.neural_scores,
                    step.symbolic_scores,
                    step.alpha,
                    step.beta
                )

            case TopK:
                # Select top-k results
                input_results = execute_plan_step(step.input, engine)
                results = select_top_k(input_results, step.k, step.order_by)

    return results


function execute_vector_search(step: VectorSearch, hnsw: HnswIndex) -> IntermediateResult:
    """
    HNSW search with filter pushdown.

    Key optimization: Apply symbolic filters during HNSW traversal
    to avoid generating candidates that will be filtered out anyway.
    """
    query_vector = step.query_vector
    similarity_threshold = step.similarity_threshold
    ef = step.ef
    inline_filters = step.filters

    # HNSW search with inline filtering
    candidates = []
    visited = set()

    # Start from entry point
    current_node = hnsw.entry_point
    layer = hnsw.max_layer

    while layer >= 0:
        # Greedy search at this layer
        while True:
            neighbors = hnsw.get_neighbors(current_node, layer)
            best_neighbor = None
            best_distance = float('inf')

            for neighbor in neighbors:
                if neighbor in visited:
                    continue

                # Apply inline filters BEFORE computing distance
                if not passes_inline_filters(neighbor, inline_filters, hnsw.metadata):
                    continue  # Skip this neighbor entirely

                # Compute distance only for filtered candidates
                distance = compute_distance(query_vector, hnsw.get_vector(neighbor))
                similarity = distance_to_similarity(distance, step.metric)

                if similarity >= similarity_threshold:
                    candidates.append((neighbor, similarity))

                if distance < best_distance:
                    best_distance = distance
                    best_neighbor = neighbor

                visited.add(neighbor)

            if best_neighbor is None:
                break  # No improvement

            current_node = best_neighbor

        layer -= 1

    # Sort candidates by similarity
    candidates.sort(key=lambda x: x[1], reverse=True)

    return IntermediateResult(
        node_ids=[node_id for node_id, _ in candidates],
        neural_scores=[score for _, score in candidates],
        candidates_evaluated=len(visited)
    )


function passes_inline_filters(node_id: u32, filters: List[InlineFilter], metadata: MetadataStore) -> bool:
    """
    Check if node passes all inline filters.

    This avoids computing distance for nodes that fail metadata constraints.
    """
    for filter in filters:
        node_value = metadata.get(node_id, filter.field)
        if not evaluate_predicate(node_value, filter.operator, filter.value):
            return False  # Failed a filter

    return True  # Passed all filters


function execute_index_scan(step: IndexScan, indexes: MetadataIndexes) -> IntermediateResult:
    """
    Scan metadata index to get matching node IDs.
    """
    index_name = step.index_name
    predicate = step.predicate

    match predicate:
        case Attribute(field, operator, value):
            if operator == ComparisonOp.Eq:
                # Exact match: use inverted index
                posting_list = indexes.inverted[field].lookup(value)
                return IntermediateResult(
                    node_ids=posting_list.to_vec(),
                    symbolic_scores=[1.0] * len(posting_list)  # Binary: matches or not
                )

            elif operator in [ComparisonOp.Lt, ComparisonOp.Le, ComparisonOp.Gt, ComparisonOp.Ge]:
                # Range query: use B-tree index
                matching_nodes = indexes.btree[field].range_query(operator, value)
                return IntermediateResult(
                    node_ids=matching_nodes.to_vec(),
                    symbolic_scores=[1.0] * len(matching_nodes)
                )

            elif operator == ComparisonOp.In:
                # IN query: union of inverted index lookups
                all_nodes = RoaringBitmap()
                for v in value.list:
                    posting_list = indexes.inverted[field].lookup(v)
                    all_nodes |= posting_list  # Union

                return IntermediateResult(
                    node_ids=all_nodes.to_vec(),
                    symbolic_scores=[1.0] * len(all_nodes)
                )


function execute_graph_traversal(step: GraphTraversal, graph_index: GraphStructureIndex) -> IntermediateResult:
    """
    Execute graph structural constraint.
    """
    match step.constraint:
        case InCommunity(community_id):
            # Lookup precomputed community membership
            node_ids = graph_index.communities.get(community_id)
            return IntermediateResult(
                node_ids=node_ids.to_vec(),
                symbolic_scores=[1.0] * len(node_ids)
            )

        case WithinKHops(source_node, k):
            # Lookup precomputed k-hop neighborhood
            key = (source_node, k)
            if key in graph_index.khop_cache:
                node_ids = graph_index.khop_cache[key]
            else:
                # Compute on-the-fly via BFS
                node_ids = compute_khop_neighbors(source_node, k, graph_index.graph)

            return IntermediateResult(
                node_ids=node_ids.to_vec(),
                symbolic_scores=[1.0 / (1 + distance)] for distance in range(len(node_ids))
            )

        case OnPath(source, target):
            # Check path cache
            path_nodes = graph_index.path_cache.get_path(source, target)
            return IntermediateResult(
                node_ids=path_nodes,
                symbolic_scores=[1.0] * len(path_nodes)
            )


function intersect_results(left: IntermediateResult, right: IntermediateResult) -> IntermediateResult:
    """
    Set intersection (AND): keep nodes in both sets.

    Use Roaring Bitmap for efficient intersection.
    """
    left_bitmap = RoaringBitmap.from_sorted(left.node_ids)
    right_bitmap = RoaringBitmap.from_sorted(right.node_ids)

    intersection = left_bitmap & right_bitmap  # Bitmap AND

    # Combine scores (average for simplicity)
    node_ids = intersection.to_vec()
    combined_scores = []
    for node_id in node_ids:
        left_score = left.get_score(node_id)
        right_score = right.get_score(node_id)
        combined_scores.append((left_score + right_score) / 2.0)

    return IntermediateResult(
        node_ids=node_ids,
        scores=combined_scores
    )


function apply_hybrid_scoring(result_set, config: HybridScoringConfig) -> IntermediateResult:
    """
    Combine neural and symbolic scores.

    Formula: hybrid_score = α * normalize(neural) + β * normalize(symbolic)
    """
    neural_scores = result_set.neural_scores
    symbolic_scores = result_set.symbolic_scores

    # Normalize scores to [0, 1]
    if config.normalization == NormalizationMethod.MinMax:
        neural_norm = min_max_normalize(neural_scores)
        symbolic_norm = min_max_normalize(symbolic_scores)
    elif config.normalization == NormalizationMethod.ZScore:
        neural_norm = z_score_normalize(neural_scores)
        symbolic_norm = z_score_normalize(symbolic_scores)
    else:
        neural_norm = neural_scores
        symbolic_norm = symbolic_scores

    # Combine with weights
    alpha = config.neural_weight
    beta = config.symbolic_weight
    hybrid_scores = [
        alpha * n + beta * s
        for n, s in zip(neural_norm, symbolic_norm)
    ]

    result_set.hybrid_scores = hybrid_scores
    return result_set

2. Query Optimization

function optimize_query(query: Query, optimizer: QueryOptimizer) -> Query:
    """
    Optimize query execution plan.

    Key optimizations:
    1. Predicate pushdown (filters into HNSW search)
    2. Index selection (choose best index for each predicate)
    3. Join reordering (cheapest predicates first)
    4. Early termination (stop when enough candidates found)
    """
    # Extract predicates from WHERE clause
    predicates = extract_predicates(query.where_clause)

    # Classify predicates
    neural_preds = [p for p in predicates if is_neural_predicate(p)]
    symbolic_preds = [p for p in predicates if is_symbolic_predicate(p)]
    graph_preds = [p for p in predicates if is_graph_predicate(p)]

    # Estimate selectivity for each predicate
    selectivities = {}
    for pred in predicates:
        selectivities[pred] = estimate_selectivity(pred, optimizer.stats)

    # Predicate pushdown: which filters can be applied during HNSW search?
    inline_filters = []
    post_filters = []

    for pred in symbolic_preds:
        if can_pushdown(pred):
            inline_filters.append(pred)
        else:
            post_filters.append(pred)

    # Index selection: choose best index for each symbolic predicate
    index_plan = {}
    for pred in symbolic_preds:
        best_index = choose_best_index(pred, optimizer.indexes, selectivities[pred])
        index_plan[pred] = best_index

    # Reorder predicates: most selective first
    ordered_predicates = sorted(predicates, key=lambda p: selectivities[p])

    # Build optimized execution plan
    optimized_query = rewrite_query(
        query,
        inline_filters=inline_filters,
        post_filters=post_filters,
        index_plan=index_plan,
        predicate_order=ordered_predicates
    )

    return optimized_query


function estimate_selectivity(predicate, stats) -> float:
    """
    Estimate fraction of nodes matching predicate.

    Uses index statistics (histograms, cardinality, etc.)
    """
    match predicate:
        case VectorSimilarity(threshold):
            # Estimate based on similarity distribution
            return estimate_similarity_selectivity(threshold, stats.similarity_histogram)

        case Attribute(field, operator, value):
            # Estimate based on attribute distribution
            if operator == ComparisonOp.Eq:
                return 1.0 / stats.cardinality[field]  # Uniform assumption
            elif operator in [Lt, Le, Gt, Ge]:
                return estimate_range_selectivity(field, operator, value, stats)
            elif operator == In:
                return len(value.list) / stats.cardinality[field]

        case Graph(constraint):
            # Estimate based on graph structure
            match constraint:
                case InCommunity(id):
                    return stats.community_sizes[id] / stats.total_nodes
                case WithinKHops(node, k):
                    return estimate_khop_size(node, k, stats) / stats.total_nodes


function can_pushdown(predicate) -> bool:
    """
    Check if predicate can be pushed into HNSW search.

    Only simple equality/range predicates on indexed fields can be pushed down.
    """
    match predicate:
        case Attribute(field, operator, value):
            # Can pushdown if operator is simple and field is indexed
            return operator in [Eq, Lt, Le, Gt, Ge, In] and is_indexed(field)

        case _:
            return False  # Complex predicates handled post-search

API Design (Function Signatures)

// File: crates/ruvector-query/src/neuro_symbolic/mod.rs

impl NeuroSymbolicEngine {
    /// Create a new neuro-symbolic query engine
    pub fn new(
        hnsw_index: Arc<HnswIndex>,
        metadata_path: impl AsRef<Path>,
    ) -> Result<Self, QueryError>;

    /// Execute a query (SQL or Cypher syntax)
    pub fn execute_query(
        &self,
        query: &str,
    ) -> Result<QueryResult, QueryError>;

    /// Execute a parsed query (AST)
    pub fn execute_parsed_query(
        &self,
        query: Query,
    ) -> Result<QueryResult, QueryError>;

    /// Add metadata index for a field
    pub fn create_index(
        &mut self,
        field: &str,
        index_type: IndexType,
    ) -> Result<(), QueryError>;

    /// Update hybrid scoring configuration
    pub fn set_scoring_config(&mut self, config: HybridScoringConfig);

    /// Get query execution statistics
    pub fn stats(&self) -> QueryEngineStats;
}

#[derive(Debug, Clone, Copy)]
pub enum IndexType {
    Inverted,  // Categorical fields
    BTree,     // Range queries
    Bitmap,    // Set operations
}

impl Query {
    /// Parse SQL query string into AST
    pub fn parse_sql(query: &str) -> Result<Self, ParseError>;

    /// Parse Cypher query string into AST
    pub fn parse_cypher(query: &str) -> Result<Self, ParseError>;

    /// Validate query syntax and semantics
    pub fn validate(&self) -> Result<(), ValidationError>;
}

impl Predicate {
    /// Evaluate predicate on a node
    pub fn evaluate(
        &self,
        node_id: u32,
        vector_store: &VectorStore,
        metadata_store: &MetadataStore,
    ) -> bool;

    /// Extract referenced fields
    pub fn referenced_fields(&self) -> Vec<String>;

    /// Check if predicate is neural (vector similarity)
    pub fn is_neural(&self) -> bool;

    /// Check if predicate is symbolic (metadata)
    pub fn is_symbolic(&self) -> bool;

    /// Check if predicate is graph-structural
    pub fn is_graph_structural(&self) -> bool;
}

impl MetadataIndexes {
    /// Create indexes from metadata file
    pub fn from_metadata(path: impl AsRef<Path>) -> Result<Self, IndexError>;

    /// Add inverted index for field
    pub fn add_inverted_index(
        &mut self,
        field: &str,
        values: HashMap<String, Vec<u32>>,
    ) -> Result<(), IndexError>;

    /// Add B-tree index for field
    pub fn add_btree_index(
        &mut self,
        field: &str,
        values: Vec<(OrderedValue, u32)>,
    ) -> Result<(), IndexError>;

    /// Query inverted index
    pub fn query_inverted(
        &self,
        field: &str,
        value: &str,
    ) -> Option<&RoaringBitmap>;

    /// Query B-tree index (range)
    pub fn query_btree_range(
        &self,
        field: &str,
        operator: ComparisonOp,
        value: OrderedValue,
    ) -> Option<RoaringBitmap>;

    /// Intersect bitmaps (AND operation)
    pub fn intersect(&self, bitmaps: &[RoaringBitmap]) -> RoaringBitmap;

    /// Union bitmaps (OR operation)
    pub fn union(&self, bitmaps: &[RoaringBitmap]) -> RoaringBitmap;

    /// Difference bitmaps (NOT operation)
    pub fn difference(&self, left: &RoaringBitmap, right: &RoaringBitmap) -> RoaringBitmap;
}

#[derive(Debug, Default)]
pub struct QueryEngineStats {
    pub total_queries: u64,
    pub avg_query_time_ms: f64,
    pub cache_hit_rate: f64,
    pub avg_candidates_evaluated: f64,
}

Integration Points

Affected Crates/Modules

  1. ruvector-query (New Crate)

    • New module: src/neuro_symbolic/mod.rs - Core engine
    • New module: src/neuro_symbolic/parser.rs - SQL/Cypher parser
    • New module: src/neuro_symbolic/optimizer.rs - Query optimizer
    • New module: src/neuro_symbolic/planner.rs - Execution planner
    • New module: src/neuro_symbolic/indexes.rs - Metadata indexing
  2. ruvector-core (Integration)

    • Modified: src/index/hnsw.rs - Add filter callback support
    • Modified: src/vector_store.rs - Expose metadata API
  3. ruvector-api (Exposure)

    • Modified: src/query.rs - Add neuro-symbolic query endpoint
    • New: src/query/sql.rs - SQL query interface
    • New: src/query/cypher.rs - Cypher query interface
  4. ruvector-bindings (Language Bindings)

    • Modified: python/src/lib.rs - Expose query API
    • Modified: nodejs/src/lib.rs - Expose query API

New Modules to Create

crates/ruvector-query/   # New crate
├── src/
│   ├── neuro_symbolic/
│   │   ├── mod.rs              # Core engine
│   │   ├── parser.rs           # Query parsing
│   │   ├── optimizer.rs        # Query optimization
│   │   ├── planner.rs          # Execution planning
│   │   ├── executor.rs         # Query execution
│   │   ├── indexes.rs          # Metadata indexing
│   │   ├── scoring.rs          # Hybrid scoring
│   │   └── stats.rs            # Statistics collection
│   └── lib.rs

examples/
├── neuro_symbolic_queries/
│   ├── sql_examples.rs         # SQL query examples
│   ├── cypher_examples.rs      # Cypher query examples
│   ├── hybrid_scoring.rs       # Hybrid scoring examples
│   └── README.md

Dependencies on Other Features

Depends On:

  • HNSW Index: Core vector search functionality
  • Existing Cypher Support: Extend existing graph query support

Synergies With:

  • GNN-Guided Routing (Feature 1): Can use GNN for smarter query execution
  • Incremental Learning (Feature 2): Real-time index updates support streaming queries

External Dependencies:

  • sqlparser - SQL parsing
  • cypher-parser - Cypher parsing (if not already present)
  • roaring - Roaring Bitmap for efficient set operations
  • serde - Query serialization

Regression Prevention

What Existing Functionality Could Break

  1. Pure Vector Search Performance

    • Risk: Adding metadata lookups slows down simple vector queries
    • Impact: Regression in baseline HNSW performance
  2. Memory Usage

    • Risk: Metadata indexes consume excessive RAM
    • Impact: OOM on large datasets
  3. Query Correctness

    • Risk: Filter pushdown logic has bugs, returns wrong results
    • Impact: Incorrect search results
  4. Cypher Compatibility

    • Risk: Extending Cypher syntax breaks existing queries
    • Impact: Breaking change for existing users

Test Cases to Prevent Regressions

// File: crates/ruvector-query/tests/neuro_symbolic_regression_tests.rs

#[test]
fn test_pure_vector_search_unchanged() {
    // Simple vector queries should have zero overhead
    let engine = setup_test_engine();

    // Baseline: pure HNSW search (no filters)
    let query_baseline = "SELECT * FROM vectors ORDER BY similarity(embedding, $query) DESC LIMIT 10";

    let start = Instant::now();
    let results = engine.execute_query(query_baseline).unwrap();
    let time_with_engine = start.elapsed();

    // Direct HNSW (without query engine)
    let start = Instant::now();
    let results_direct = engine.hnsw_index.search(&query_vector, 10).unwrap();
    let time_direct = start.elapsed();

    // Query engine should add <5% overhead
    let overhead = (time_with_engine.as_secs_f64() / time_direct.as_secs_f64()) - 1.0;
    assert!(overhead < 0.05, "Overhead: {:.2}%, expected <5%", overhead * 100.0);

    // Results should be identical
    assert_eq!(results.node_ids, results_direct.node_ids);
}

#[test]
fn test_filter_correctness() {
    // Filtered queries must return correct subset
    let engine = setup_test_engine_with_metadata();

    let query = "SELECT * FROM vectors
                 WHERE similarity(embedding, $query) > 0.8
                 AND category = 'research'
                 AND year >= 2023
                 LIMIT 10";

    let results = engine.execute_query(query).unwrap();

    // Verify each result matches ALL predicates
    for node_id in &results.node_ids {
        let similarity = compute_similarity(&query_vector, engine.get_vector(*node_id));
        assert!(similarity > 0.8, "Node {} similarity: {}, expected >0.8", node_id, similarity);

        let category = engine.get_metadata(*node_id, "category");
        assert_eq!(category, "research", "Node {} category: {}, expected 'research'", node_id, category);

        let year = engine.get_metadata(*node_id, "year").parse::<i32>().unwrap();
        assert!(year >= 2023, "Node {} year: {}, expected >=2023", node_id, year);
    }
}

#[test]
fn test_filter_pushdown_performance() {
    // Pushdown filters should be much faster than post-filtering
    let engine = setup_test_engine_with_metadata();

    // With pushdown (optimized)
    let query_pushdown = "SELECT * FROM vectors
                          WHERE similarity(embedding, $query) > 0.8
                          AND category = 'research'
                          LIMIT 10";

    let start = Instant::now();
    let results_pushdown = engine.execute_query(query_pushdown).unwrap();
    let time_pushdown = start.elapsed();

    // Without pushdown (post-filter, manual implementation)
    let all_results = engine.hnsw_index.search(&query_vector, 10000).unwrap();
    let start = Instant::now();
    let results_post: Vec<_> = all_results.into_iter()
        .filter(|r| r.similarity > 0.8)
        .filter(|r| engine.get_metadata(r.node_id, "category") == "research")
        .take(10)
        .collect();
    let time_post = start.elapsed();

    // Pushdown should be ≥5x faster
    let speedup = time_post.as_secs_f64() / time_pushdown.as_secs_f64();
    assert!(speedup >= 5.0, "Speedup: {:.1}x, expected ≥5x", speedup);

    // Results should be identical
    assert_eq!(results_pushdown.node_ids.len(), results_post.len());
}

#[test]
fn test_hybrid_scoring_correctness() {
    // Hybrid scores should combine neural and symbolic correctly
    let engine = setup_test_engine();
    engine.set_scoring_config(HybridScoringConfig {
        neural_weight: 0.7,
        symbolic_weight: 0.3,
        normalization: NormalizationMethod::MinMax,
    });

    let query = "SELECT * FROM vectors
                 WHERE similarity(embedding, $query) > 0.5
                 AND year >= 2020
                 ORDER BY hybrid_score DESC
                 LIMIT 10";

    let results = engine.execute_query(query).unwrap();

    // Verify hybrid score formula
    for i in 0..results.node_ids.len() {
        let neural = results.neural_scores[i];
        let symbolic = results.symbolic_scores.as_ref().unwrap()[i];

        // Normalize (min-max)
        let neural_norm = (neural - 0.5) / (1.0 - 0.5);  // Assuming min=0.5, max=1.0
        let symbolic_norm = (symbolic - 0.0) / (1.0 - 0.0);  // Assuming min=0.0, max=1.0

        let expected_hybrid = 0.7 * neural_norm + 0.3 * symbolic_norm;
        let actual_hybrid = results.hybrid_scores[i];

        assert!((expected_hybrid - actual_hybrid).abs() < 1e-5,
            "Hybrid score mismatch: expected {}, got {}", expected_hybrid, actual_hybrid);
    }
}

#[test]
fn test_boolean_logic_correctness() {
    // AND/OR/NOT operations must be correct
    let engine = setup_test_engine();

    // Test AND
    let query_and = "SELECT * FROM vectors
                     WHERE category = 'A' AND tag = 'X'";
    let results_and = engine.execute_query(query_and).unwrap();

    for node_id in &results_and.node_ids {
        assert_eq!(engine.get_metadata(*node_id, "category"), "A");
        assert_eq!(engine.get_metadata(*node_id, "tag"), "X");
    }

    // Test OR
    let query_or = "SELECT * FROM vectors
                    WHERE category = 'A' OR category = 'B'";
    let results_or = engine.execute_query(query_or).unwrap();

    for node_id in &results_or.node_ids {
        let category = engine.get_metadata(*node_id, "category");
        assert!(category == "A" || category == "B");
    }

    // Test NOT
    let query_not = "SELECT * FROM vectors
                     WHERE category = 'A' AND NOT tag = 'X'";
    let results_not = engine.execute_query(query_not).unwrap();

    for node_id in &results_not.node_ids {
        assert_eq!(engine.get_metadata(*node_id, "category"), "A");
        assert_ne!(engine.get_metadata(*node_id, "tag"), "X");
    }
}

Backward Compatibility Strategy

  1. Opt-In Feature

    • Neuro-symbolic queries are opt-in (require explicit SQL/Cypher syntax)
    • Existing vector search API unchanged
  2. Graceful Degradation

    • If metadata indexes not available, fallback to post-filtering
    • Log warning but do not crash
  3. Configuration

    query:
      neuro_symbolic:
        enabled: true  # Default: true
        metadata_indexes: true  # Default: true
        hybrid_scoring: true  # Default: true
    
  4. API Versioning

    • New endpoints for neuro-symbolic queries (/query/sql, /query/cypher)
    • Existing endpoints (/search) unchanged

Implementation Phases

Phase 1: Core Infrastructure (Week 1-2)

Goal: Query parsing and basic execution

Tasks:

  1. Implement SQL/Cypher parser
  2. Build AST representation
  3. Implement basic query executor (no optimization)
  4. Unit tests for parsing and execution

Deliverables:

  • neuro_symbolic/parser.rs
  • neuro_symbolic/executor.rs
  • Passing unit tests

Success Criteria:

  • Can parse and execute simple queries (vector similarity only)
  • Correct results (matches HNSW baseline)

Phase 2: Metadata Indexing (Week 2-3)

Goal: Support symbolic predicates

Tasks:

  1. Implement inverted index for categorical fields
  2. Implement B-tree index for range queries
  3. Integrate Roaring Bitmap for set operations
  4. Test index correctness and performance

Deliverables:

  • neuro_symbolic/indexes.rs
  • Index creation and query APIs
  • Benchmark report

Success Criteria:

  • Indexes correctly return matching nodes
  • Index queries <10ms for typical workloads
  • Memory overhead <20% of vector data

Phase 3: Filter Pushdown (Week 3-4)

Goal: Optimize query execution

Tasks:

  1. Implement filter pushdown into HNSW search
  2. Modify HNSW to support filter callbacks
  3. Benchmark speedup vs post-filtering
  4. Test correctness of pushdown logic

Deliverables:

  • Modified hnsw.rs with filter support
  • neuro_symbolic/optimizer.rs
  • Performance benchmarks

Success Criteria:

  • ≥5x speedup for filtered queries
  • Zero correctness regressions
  • Works with complex boolean logic (AND/OR/NOT)

Phase 4: Hybrid Scoring (Week 4-5)

Goal: Combine neural and symbolic scores

Tasks:

  1. Implement hybrid scoring algorithm
  2. Add score normalization methods
  3. Tune weights (α, β) for best results
  4. Test on real-world datasets

Deliverables:

  • neuro_symbolic/scoring.rs
  • Hybrid scoring benchmarks
  • Configuration guide

Success Criteria:

  • Hybrid queries improve relevance metrics (NDCG, MRR)
  • Configurable weights work as expected
  • Performance <20ms for typical queries

Phase 5: Production Hardening (Week 5-6)

Goal: Production-ready feature

Tasks:

  1. Add comprehensive error handling
  2. Write documentation and examples
  3. Stress testing (large datasets, complex queries)
  4. Integration with existing Cypher support

Deliverables:

  • Full error handling
  • User documentation
  • Example queries
  • Regression test suite

Success Criteria:

  • Zero crashes in stress tests
  • Documentation complete
  • Ready for alpha release

Success Metrics

Performance Benchmarks

Primary Metrics (Must Achieve):

Query Type Baseline (Post-Filter) Neuro-Symbolic Target Improvement
Similarity + 1 filter 50ms 5ms 10x faster
Similarity + 3 filters 200ms 8ms 25x faster
Complex boolean (AND/OR/NOT) N/A (manual) 15ms New capability
Multi-modal (vector + graph) 500ms (manual joins) 20ms 25x faster

Secondary Metrics:

Metric Target
Index memory overhead <20% of vector data
Query parsing time <1ms
Hybrid scoring overhead <2ms
Concurrent query throughput Same as baseline

Accuracy Metrics

Relevance Improvement (on benchmark datasets):

  • NDCG@10: +15% (hybrid scoring vs pure vector)
  • MRR (Mean Reciprocal Rank): +20%
  • Precision@10: +10%

Correctness:

  • 100% of filtered results match all predicates
  • Zero false positives or false negatives

Memory/Latency Targets

Memory:

  • Inverted indexes: <100MB per 1M nodes (categorical fields)
  • B-tree indexes: <50MB per 1M nodes (range fields)
  • Total overhead: <20% of vector index size

Latency:

  • Simple query (1 filter): <10ms
  • Complex query (3+ filters): <20ms
  • Hybrid scoring: <5ms overhead
  • P99 latency: <50ms

Throughput:

  • Concurrent queries: Same as baseline HNSW
  • No lock contention on indexes

Risks and Mitigations

Technical Risks

Risk 1: Query Parser Complexity

Probability: Medium | Impact: Medium

Description: SQL/Cypher parsing is complex, could have bugs or performance issues.

Mitigation:

  • Use established parsing libraries (sqlparser, cypher-parser)
  • Extensive test suite with edge cases
  • Validate AST before execution
  • Provide query validation tool

Contingency: Start with simple query subset, expand incrementally.


Risk 2: Index Memory Overhead

Probability: High | Impact: Medium

Description: Metadata indexes could consume excessive memory on large datasets.

Mitigation:

  • Use compressed indexes (Roaring Bitmap for sparse sets)
  • Make indexing optional (user chooses which fields to index)
  • Monitor memory usage in tests
  • Provide index size estimation tool

Contingency: Support external indexes (e.g., SQLite) for low-memory environments.


Risk 3: Filter Pushdown Bugs

Probability: Medium | Impact: Critical

Description: Incorrect filter logic could return wrong results.

Mitigation:

  • Extensive correctness testing (ground truth validation)
  • Compare pushdown results vs post-filtering
  • Add assertion checks in debug builds
  • Fuzzing for edge cases

Contingency: Add "safe mode" that validates results against post-filtering.


Risk 4: Hybrid Scoring Tuning Difficulty

Probability: High | Impact: Low

Description: Users may struggle to tune α/β weights for hybrid scoring.

Mitigation:

  • Provide automatic weight tuning (based on query logs)
  • Document recommended defaults for common use cases
  • Add visualization tools for score distributions
  • Support A/B testing framework

Contingency: Default to pure neural scoring (α=1, β=0) if user unsure.


Risk 5: Cypher Integration Conflicts

Probability: Low | Impact: Medium

Description: Extending Cypher syntax could conflict with existing graph queries.

Mitigation:

  • Careful syntax design (use reserved keywords)
  • Version Cypher extensions separately
  • Extensive compatibility testing
  • Document syntax differences

Contingency: Use separate query language (e.g., extended SQL only) if conflicts arise.


Summary Risk Matrix

Risk Probability Impact Mitigation Priority
Query parser complexity Medium Medium Medium
Index memory overhead High Medium HIGH
Filter pushdown bugs Medium Critical CRITICAL
Hybrid scoring tuning High Low LOW
Cypher integration conflicts Low Medium Medium

Next Steps

  1. Prototype Phase 1: Build SQL parser and basic executor (1 week)
  2. Validate Queries: Test on simple queries, measure correctness (2 days)
  3. Add Metadata Indexes: Implement inverted + B-tree indexes (1 week)
  4. Benchmark Performance: Measure speedup vs post-filtering (3 days)
  5. Iterate: Optimize based on profiling (ongoing)

Key Decision Points:

  • After Phase 1: Is query parsing fast enough? (<1ms target)
  • After Phase 3: Does filter pushdown work correctly? (Zero regressions)
  • After Phase 4: Does hybrid scoring improve relevance? (+10% NDCG required)

Go/No-Go Criteria:

  • 5x+ speedup on filtered queries
  • Zero correctness regressions
  • Memory overhead <20%
  • Improved relevance metrics