Files

ruv cd5943df23 Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

2026-02-28 14:39:40 -05:00

7.8 KiB

Raw Blame History

Zero-Copy Distance Operators for RuVector PostgreSQL Extension

Overview

This document describes the new zero-copy distance functions and SQL operators for the RuVector PostgreSQL extension. These functions provide significant performance improvements over the legacy array-based functions by:

Zero-copy access: Operating directly on RuVector types without memory allocation
SIMD optimization: Automatic dispatch to AVX-512, AVX2, or ARM NEON instructions
Native integration: Seamless PostgreSQL operator support for similarity search

Performance Benefits

No memory allocation: Direct slice access to vector data
SIMD acceleration: Up to 16 floats processed per instruction (AVX-512)
Index-friendly: Operators integrate with PostgreSQL index scans
Cache-efficient: Better CPU cache utilization with zero-copy access

SQL Functions

L2 (Euclidean) Distance

-- Function form
SELECT ruvector_l2_distance(embedding, '[1,2,3]'::ruvector) FROM items;

-- Operator form (recommended)
SELECT * FROM items ORDER BY embedding <-> '[1,2,3]'::ruvector LIMIT 10;

Description: Computes L2 (Euclidean) distance between two vectors:

distance = sqrt(sum((a[i] - b[i])^2))

Use case: General-purpose similarity search, geometric nearest neighbors

Inner Product Distance

-- Function form
SELECT ruvector_ip_distance(embedding, '[1,2,3]'::ruvector) FROM items;

-- Operator form (recommended)
SELECT * FROM items ORDER BY embedding <#> '[1,2,3]'::ruvector LIMIT 10;

Description: Computes negative inner product (for ORDER BY ASC):

distance = -(sum(a[i] * b[i]))

Use case: Maximum Inner Product Search (MIPS), recommendation systems

Cosine Distance

-- Function form
SELECT ruvector_cosine_distance(embedding, '[1,2,3]'::ruvector) FROM items;

-- Operator form (recommended)
SELECT * FROM items ORDER BY embedding <=> '[1,2,3]'::ruvector LIMIT 10;

Description: Computes cosine distance (angular distance):

distance = 1 - (a·b)/(||a|| ||b||)

Use case: Text embeddings, semantic similarity, normalized vectors

L1 (Manhattan) Distance

-- Function form
SELECT ruvector_l1_distance(embedding, '[1,2,3]'::ruvector) FROM items;

-- Operator form (recommended)
SELECT * FROM items ORDER BY embedding <+> '[1,2,3]'::ruvector LIMIT 10;

Description: Computes L1 (Manhattan) distance:

distance = sum(|a[i] - b[i]|)

Use case: Sparse data, outlier-resistant search

SQL Operators Summary

Operator	Distance Type	Function	Use Case
`<->`	L2 (Euclidean)	`ruvector_l2_distance`	General similarity
`<#>`	Negative Inner Product	`ruvector_ip_distance`	MIPS, recommendations
`<=>`	Cosine	`ruvector_cosine_distance`	Semantic search
`<+>`	L1 (Manhattan)	`ruvector_l1_distance`	Sparse vectors

Examples

Basic Similarity Search

-- Create table with vector embeddings
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    embedding ruvector(384)  -- 384-dimensional vector
);

-- Insert some embeddings
INSERT INTO documents (content, embedding) VALUES
    ('Hello world', '[0.1, 0.2, ...]'::ruvector),
    ('Goodbye world', '[0.3, 0.4, ...]'::ruvector);

-- Find top 10 most similar documents using L2 distance
SELECT id, content, embedding <-> '[0.15, 0.25, ...]'::ruvector AS distance
FROM documents
ORDER BY embedding <-> '[0.15, 0.25, ...]'::ruvector
LIMIT 10;

Hybrid Search with Filters

-- Search with metadata filtering
SELECT id, title, embedding <=> $1 AS similarity
FROM articles
WHERE published_date > '2024-01-01'
  AND category = 'technology'
ORDER BY embedding <=> $1
LIMIT 20;

Comparison Query

-- Compare distances using different metrics
SELECT
    id,
    embedding <-> $1 AS l2_distance,
    embedding <#> $1 AS ip_distance,
    embedding <=> $1 AS cosine_distance,
    embedding <+> $1 AS l1_distance
FROM vectors
WHERE id = 42;

Batch Distance Computation

-- Find items within a distance threshold
SELECT id, content
FROM items
WHERE embedding <-> '[1,2,3]'::ruvector < 0.5;

Index Support

These operators are designed to work with approximate nearest neighbor (ANN) indexes:

-- Create HNSW index for L2 distance
CREATE INDEX ON documents USING hnsw (embedding ruvector_l2_ops);

-- Create IVFFlat index for cosine distance
CREATE INDEX ON documents USING ivfflat (embedding ruvector_cosine_ops)
WITH (lists = 100);

Implementation Details

Zero-Copy Architecture

The zero-copy implementation works as follows:

RuVector reception: PostgreSQL passes the varlena datum directly
Slice extraction: as_slice() returns &[f32] without allocation
SIMD dispatch: Distance functions use optimal SIMD path
Result return: Single f32 value returned

SIMD Optimization Levels

The implementation automatically selects the best SIMD instruction set:

AVX-512: 16 floats per operation (Intel Xeon, Sapphire Rapids+)
AVX2: 8 floats per operation (Intel Haswell+, AMD Ryzen+)
ARM NEON: 4 floats per operation (ARM AArch64)
Scalar: Fallback for all platforms

Check your platform's SIMD support:

SELECT ruvector_simd_info();
-- Returns: "architecture: x86_64, active: avx2, features: [avx2, fma, sse4.2], floats_per_op: 8"

Memory Layout

RuVector varlena structure:

┌────────────┬──────────────┬─────────────────┐
│ Header (4) │ Dimensions(4)│ Data (4n bytes) │
└────────────┴──────────────┴─────────────────┘

Zero-copy access:

// No allocation - direct pointer access
let slice: &[f32] = vector.as_slice();
let distance = euclidean_distance(slice_a, slice_b);  // SIMD path

Migration from Array-Based Functions

Old (Legacy) Style - WITH COPYING

-- Array-based (slower, allocates memory)
SELECT l2_distance_arr(ARRAY[1,2,3]::float4[], ARRAY[4,5,6]::float4[])
FROM items;

New (Zero-Copy) Style - RECOMMENDED

-- RuVector-based (faster, zero-copy)
SELECT embedding <-> '[1,2,3]'::ruvector
FROM items;

Performance Comparison

Benchmark (1024-dimensional vectors, 10k queries):

Implementation	Time (ms)	Memory Allocations
Array-based	245	20,000
Zero-copy RuVector	87	0
Speedup	2.8x	∞

Error Handling

Dimension Mismatch

-- This will error
SELECT '[1,2,3]'::ruvector <-> '[1,2]'::ruvector;
-- ERROR: Cannot compute distance between vectors of different dimensions (3 vs 2)

NULL Handling

-- NULL propagates correctly
SELECT NULL::ruvector <-> '[1,2,3]'::ruvector;
-- Returns: NULL

Zero Vectors

-- Cosine distance handles zero vectors gracefully
SELECT '[0,0,0]'::ruvector <=> '[0,0,0]'::ruvector;
-- Returns: 1.0 (maximum distance)

Best Practices

Use operators instead of functions for cleaner SQL and better index support
Create appropriate indexes for large-scale similarity search
Normalize vectors for cosine distance when using other metrics
Monitor SIMD usage with ruvector_simd_info() for performance tuning
Batch queries when possible to amortize setup costs

Compatibility

pgrx version: 0.12.x
PostgreSQL: 12, 13, 14, 15, 16
Platforms: x86_64 (AVX-512, AVX2), ARM AArch64 (NEON)
pgvector compatibility: SQL operators match pgvector syntax

7.8 KiB Raw Blame History