Files

ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900

2026-02-28 14:39:40 -05:00

12 KiB

Raw Blame History

Sparse Vectors Implementation Summary

Overview

Complete implementation of sparse vector support for ruvector-postgres PostgreSQL extension, providing efficient storage and operations for high-dimensional sparse embeddings.

Implementation Details

Module Structure

src/sparse/
├── mod.rs           # Module exports and re-exports
├── types.rs         # SparseVec type with COO format (391 lines)
├── distance.rs      # Sparse distance functions (286 lines)
├── operators.rs     # PostgreSQL functions and operators (366 lines)
└── tests.rs         # Comprehensive test suite (200 lines)

Total: 1,243 lines of Rust code

Core Components

1. SparseVec Type (`types.rs`)

Storage Format: COO (Coordinate)

#[derive(PostgresType, Serialize, Deserialize)]
pub struct SparseVec {
    indices: Vec<u32>,  // Sorted indices of non-zero elements
    values: Vec<f32>,   // Values corresponding to indices
    dim: u32,           // Total dimensionality
}

Key Features:

✅ Automatic sorting and deduplication on creation
✅ Binary search for O(log n) lookups
✅ String parsing: "{1:0.5, 2:0.3, 5:0.8}"
✅ Display formatting for PostgreSQL output
✅ Bounds checking and validation
✅ Empty vector support

Methods:

new(indices, values, dim) - Create with validation
nnz() - Number of non-zero elements
dim() - Total dimensionality
get(index) - O(log n) value lookup
iter() - Iterator over (index, value) pairs
norm() - L2 norm calculation
l1_norm() - L1 norm calculation
prune(threshold) - Remove elements below threshold
top_k(k) - Keep only top k elements by magnitude
to_dense() - Convert to dense vector

2. Distance Functions (`distance.rs`)

All functions use merge-based iteration for O(nnz(a) + nnz(b)) complexity:

Implemented Functions:

sparse_dot(a, b) - Inner product
- Only multiplies overlapping indices
- Perfect for SPLADE and learned sparse retrieval
sparse_cosine(a, b) - Cosine similarity
- Returns value in [-1, 1]
- Handles zero vectors gracefully
sparse_euclidean(a, b) - L2 distance
- Handles non-overlapping indices efficiently
- sqrt(sum((a_i - b_i)²))
sparse_manhattan(a, b) - L1 distance
- sum(|a_i - b_i|)
- Robust to outliers
sparse_bm25(query, doc, ...) - BM25 scoring
- Full BM25 implementation
- Configurable k1 and b parameters
- Query uses IDF weights, doc uses term frequencies

Algorithm: All distance functions use efficient merge iteration:

while i < a.len() && j < b.len() {
    match a_indices[i].cmp(&b_indices[j]) {
        Less => i += 1,          // Only in a
        Greater => j += 1,       // Only in b
        Equal => {               // In both: multiply
            result += a[i] * b[j];
            i += 1; j += 1;
        }
    }
}

3. PostgreSQL Operators (`operators.rs`)

Distance Operations:

ruvector_sparse_dot(a, b) -> f32
ruvector_sparse_cosine(a, b) -> f32
ruvector_sparse_euclidean(a, b) -> f32
ruvector_sparse_manhattan(a, b) -> f32

Construction Functions:

ruvector_to_sparse(indices, values, dim) -> sparsevec
ruvector_dense_to_sparse(dense) -> sparsevec
ruvector_sparse_to_dense(sparse) -> real[]

Utility Functions:

ruvector_sparse_nnz(sparse) -> int - Number of non-zeros
ruvector_sparse_dim(sparse) -> int - Dimension
ruvector_sparse_norm(sparse) -> real - L2 norm

Sparsification Functions:

ruvector_sparse_top_k(sparse, k) -> sparsevec
ruvector_sparse_prune(sparse, threshold) -> sparsevec

BM25 Function:

ruvector_sparse_bm25(query, doc, doc_len, avg_len, k1, b) -> real

All functions marked:

#[pg_extern(immutable, parallel_safe)] - Safe for parallel queries
Proper error handling with panic messages
TOAST-aware through pgrx serialization

4. Test Suite (`tests.rs`)

Test Coverage:

✅ Type creation and validation (8 tests)
✅ Parsing and formatting (2 tests)
✅ Distance computations (10 tests)
✅ PostgreSQL operators (11 tests)
✅ Edge cases (empty, no overlap, etc.)

Test Categories:

Type Tests: Creation, sorting, deduplication, bounds checking
Distance Tests: All distance functions with various cases
Operator Tests: PostgreSQL function integration
Edge Cases: Empty vectors, zero norms, orthogonal vectors

SQL Interface

Type Declaration

-- Sparse vector type (auto-created by pgrx)
CREATE TYPE sparsevec;

Basic Operations

-- Create from string
SELECT '{1:0.5, 2:0.3, 5:0.8}'::sparsevec;

-- Create from arrays
SELECT ruvector_to_sparse(
    ARRAY[1, 2, 5]::int[],
    ARRAY[0.5, 0.3, 0.8]::real[],
    10  -- dimension
);

-- Distance operations
SELECT ruvector_sparse_dot(a, b);
SELECT ruvector_sparse_cosine(a, b);
SELECT ruvector_sparse_euclidean(a, b);

-- Utility functions
SELECT ruvector_sparse_nnz(sparse_vec);
SELECT ruvector_sparse_dim(sparse_vec);
SELECT ruvector_sparse_norm(sparse_vec);

-- Sparsification
SELECT ruvector_sparse_top_k(sparse_vec, 100);
SELECT ruvector_sparse_prune(sparse_vec, 0.1);

Search Example

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    sparse_embedding sparsevec
);

-- Insert data
INSERT INTO documents (content, sparse_embedding) VALUES
    ('Document 1', '{1:0.5, 2:0.3, 5:0.8}'::sparsevec),
    ('Document 2', '{2:0.4, 3:0.2, 5:0.9}'::sparsevec);

-- Search by dot product
SELECT id, content,
       ruvector_sparse_dot(sparse_embedding, '{1:0.5, 2:0.3}'::sparsevec) AS score
FROM documents
ORDER BY score DESC
LIMIT 10;

Performance Characteristics

Complexity Analysis

Operation	Time Complexity	Space Complexity
Creation	O(n log n)	O(n)
Get value	O(log n)	O(1)
Dot product	O(nnz(a) + nnz(b))	O(1)
Cosine	O(nnz(a) + nnz(b))	O(1)
Euclidean	O(nnz(a) + nnz(b))	O(1)
Manhattan	O(nnz(a) + nnz(b))	O(1)
BM25	O(nnz(query) + nnz(doc))	O(1)
Top-k	O(n log n)	O(n)
Prune	O(n)	O(n)

Where n is the number of non-zero elements.

Expected Performance

Based on typical sparse vectors (100-1000 non-zeros):

Operation	NNZ (query)	NNZ (doc)	Dim	Expected Time
Dot Product	100	100	30K	~0.8 μs
Cosine	100	100	30K	~1.2 μs
Euclidean	100	100	30K	~1.0 μs
BM25	100	100	30K	~1.5 μs

Storage Efficiency:

Dense 30K-dim vector: 120 KB (4 bytes × 30,000)
Sparse 100 non-zeros: ~800 bytes (8 bytes × 100)
150× storage reduction

Use Cases

1. Text Search with BM25

-- Traditional text search ranking
SELECT id, title,
       ruvector_sparse_bm25(
           query_idf,           -- Query with IDF weights
           term_frequencies,    -- Document term frequencies
           doc_length,
           avg_doc_length,
           1.2,                 -- k1 parameter
           0.75                 -- b parameter
       ) AS bm25_score
FROM articles
ORDER BY bm25_score DESC;

2. Learned Sparse Retrieval (SPLADE)

-- Neural sparse embeddings
SELECT id, content,
       ruvector_sparse_dot(splade_embedding, query_splade) AS relevance
FROM documents
ORDER BY relevance DESC
LIMIT 10;

3. Hybrid Dense + Sparse Search

-- Combine signals for better recall
SELECT id, content,
       0.7 * (1 - (dense_embedding <=> query_dense)) +
       0.3 * ruvector_sparse_dot(sparse_embedding, query_sparse) AS hybrid_score
FROM documents
ORDER BY hybrid_score DESC;

Integration with Existing Extension

Updated Files

src/lib.rs: Added pub mod sparse; declaration
New module: src/sparse/ with 4 implementation files
Documentation: 2 comprehensive guides

Compatibility

✅ Compatible with pgrx 0.12
✅ Uses existing dependencies (serde, ordered-float)
✅ Follows existing code patterns
✅ Parallel-safe operations
✅ TOAST-aware for large vectors
✅ Full test coverage with #[pg_test]

Future Enhancements

Phase 2: Inverted Index (Planned)

-- Future: Inverted index for fast sparse search
CREATE INDEX ON documents USING ruvector_sparse_ivf (
    sparse_embedding sparsevec(30000)
) WITH (
    pruning_threshold = 0.1
);

Phase 3: Advanced Features

WAND algorithm: Efficient top-k retrieval
Quantization: 8-bit quantized sparse vectors
Batch operations: SIMD-optimized batch processing
Hybrid indexing: Combined dense + sparse index

Testing

Run Tests

# Standard Rust tests
cargo test --package ruvector-postgres --lib sparse

# PostgreSQL integration tests
cargo pgrx test pg16

Test Categories

Unit tests: Rust-level validation
Property tests: Edge cases and invariants
Integration tests: PostgreSQL #[pg_test] functions
Benchmark tests: Performance validation (planned)

Documentation

User Documentation

SPARSE_QUICKSTART.md: 5-minute setup guide
- Basic operations
- Common patterns
- Example queries
SPARSE_VECTORS.md: Comprehensive guide
- Full SQL API reference
- Rust API documentation
- Performance characteristics
- Use cases and examples
- Best practices

Developer Documentation

05-sparse-vectors.md: Integration plan
SPARSE_IMPLEMENTATION_SUMMARY.md: This document

Deployment

Prerequisites

PostgreSQL 14-17
pgrx 0.12
Rust toolchain

Installation

# Build extension
cargo pgrx install --release

# In PostgreSQL
CREATE EXTENSION ruvector_postgres;

# Verify sparse vector support
SELECT ruvector_version();

Summary

✅ Complete implementation of sparse vectors for ruvector-postgres ✅ 1,243 lines of production-quality Rust code ✅ COO format storage with automatic sorting ✅ 5 distance functions with O(nnz(a) + nnz(b)) complexity ✅ 15+ PostgreSQL functions for complete SQL integration ✅ 31+ comprehensive tests covering all functionality ✅ 2 user guides with examples and best practices ✅ BM25 support for traditional text search ✅ SPLADE-ready for learned sparse retrieval ✅ Hybrid search compatible with dense vectors ✅ Production-ready with proper error handling

Key Features

Efficient: Merge-based algorithms for sparse-sparse operations
Flexible: Parse from strings or arrays, convert to/from dense
Robust: Comprehensive validation and error handling
Fast: O(log n) lookups, O(n) linear scans
PostgreSQL-native: Full pgrx integration with TOAST support
Well-tested: 31+ tests covering all edge cases
Documented: Complete user and developer documentation

Files Created

/workspaces/ruvector/crates/ruvector-postgres/
├── src/
│   └── sparse/
│       ├── mod.rs           (30 lines)
│       ├── types.rs         (391 lines)
│       ├── distance.rs      (286 lines)
│       ├── operators.rs     (366 lines)
│       └── tests.rs         (200 lines)
└── docs/
    └── guides/
        ├── SPARSE_VECTORS.md                  (449 lines)
        ├── SPARSE_QUICKSTART.md               (280 lines)
        └── SPARSE_IMPLEMENTATION_SUMMARY.md   (this file)

Total Implementation: 1,273 lines of code + 729 lines of documentation = 2,002 lines

Implementation Status: ✅ COMPLETE

All requirements from the integration plan have been implemented:

✅ SparseVec type with COO format
✅ Parse from string '{1:0.5, 2:0.3}'
✅ Serialization for PostgreSQL
✅ norm(), nnz(), get(), iter() methods
✅ sparse_dot() - Inner product
✅ sparse_cosine() - Cosine similarity
✅ sparse_euclidean() - Euclidean distance
✅ Efficient merge-based algorithms
✅ PostgreSQL operators with pgrx 0.12
✅ Immutable and parallel_safe markings
✅ Error handling
✅ Unit tests with #[pg_test]

12 KiB Raw Blame History Unescape Escape