Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
This commit is contained in:
386
docs/hnsw/HNSW_INDEX.md
Normal file
386
docs/hnsw/HNSW_INDEX.md
Normal file
@@ -0,0 +1,386 @@
|
||||
# HNSW Index Implementation
|
||||
|
||||
## Overview
|
||||
|
||||
This document describes the HNSW (Hierarchical Navigable Small World) index implementation as a PostgreSQL Access Method for the RuVector extension.
|
||||
|
||||
## What is HNSW?
|
||||
|
||||
HNSW is a graph-based algorithm for approximate nearest neighbor (ANN) search in high-dimensional spaces. It provides:
|
||||
|
||||
- **Logarithmic search complexity**: O(log N) average case
|
||||
- **High recall**: >95% recall achievable with proper parameters
|
||||
- **Incremental updates**: Supports efficient insertions and deletions
|
||||
- **Multi-layer graph structure**: Hierarchical organization for fast traversal
|
||||
|
||||
## Architecture
|
||||
|
||||
### Page-Based Storage
|
||||
|
||||
The HNSW index stores data in PostgreSQL pages for durability and memory management:
|
||||
|
||||
```
|
||||
Page 0 (Metadata):
|
||||
├─ Magic number: 0x484E5357 ("HNSW")
|
||||
├─ Version: 1
|
||||
├─ Dimensions: Vector dimensionality
|
||||
├─ Parameters: m, m0, ef_construction
|
||||
├─ Entry point: Block number of top-level node
|
||||
├─ Max layer: Highest layer in the graph
|
||||
└─ Metric: Distance metric (L2/Cosine/IP)
|
||||
|
||||
Page 1+ (Node Pages):
|
||||
├─ Node Header:
|
||||
│ ├─ Page type: HNSW_PAGE_NODE
|
||||
│ ├─ Max layer: Highest layer for this node
|
||||
│ └─ Item pointer: TID of heap tuple
|
||||
├─ Vector data: [f32; dimensions]
|
||||
├─ Layer 0 neighbors: [BlockNumber; m0]
|
||||
└─ Layer 1+ neighbors: [[BlockNumber; m]; max_layer]
|
||||
```
|
||||
|
||||
### Access Method Callbacks
|
||||
|
||||
The implementation provides all required PostgreSQL index AM callbacks:
|
||||
|
||||
1. **`ambuild`** - Builds index from table data
|
||||
2. **`ambuildempty`** - Creates empty index structure
|
||||
3. **`aminsert`** - Inserts a single vector
|
||||
4. **`ambulkdelete`** - Bulk deletion support
|
||||
5. **`amvacuumcleanup`** - Vacuum cleanup operations
|
||||
6. **`amcostestimate`** - Query cost estimation
|
||||
7. **`amgettuple`** - Sequential tuple retrieval
|
||||
8. **`amgetbitmap`** - Bitmap scan support
|
||||
9. **`amcanreturn`** - Index-only scan capability
|
||||
10. **`amoptions`** - Index option parsing
|
||||
|
||||
## Usage
|
||||
|
||||
### Creating an HNSW Index
|
||||
|
||||
```sql
|
||||
-- Basic index creation (L2 distance, default parameters)
|
||||
CREATE INDEX ON items USING hnsw (embedding hnsw_l2_ops);
|
||||
|
||||
-- With custom parameters
|
||||
CREATE INDEX ON items USING hnsw (embedding hnsw_l2_ops)
|
||||
WITH (m = 32, ef_construction = 128);
|
||||
|
||||
-- Cosine distance
|
||||
CREATE INDEX ON items USING hnsw (embedding hnsw_cosine_ops);
|
||||
|
||||
-- Inner product
|
||||
CREATE INDEX ON items USING hnsw (embedding hnsw_ip_ops);
|
||||
```
|
||||
|
||||
### Querying
|
||||
|
||||
```sql
|
||||
-- Find 10 nearest neighbors using L2 distance
|
||||
SELECT id, embedding <-> ARRAY[0.1, 0.2, 0.3]::real[] AS distance
|
||||
FROM items
|
||||
ORDER BY embedding <-> ARRAY[0.1, 0.2, 0.3]::real[]
|
||||
LIMIT 10;
|
||||
|
||||
-- Find 10 nearest neighbors using cosine distance
|
||||
SELECT id, embedding <=> ARRAY[0.1, 0.2, 0.3]::real[] AS distance
|
||||
FROM items
|
||||
ORDER BY embedding <=> ARRAY[0.1, 0.2, 0.3]::real[]
|
||||
LIMIT 10;
|
||||
|
||||
-- Find vectors with largest inner product
|
||||
SELECT id, embedding <#> ARRAY[0.1, 0.2, 0.3]::real[] AS neg_ip
|
||||
FROM items
|
||||
ORDER BY embedding <#> ARRAY[0.1, 0.2, 0.3]::real[]
|
||||
LIMIT 10;
|
||||
```
|
||||
|
||||
## Parameters
|
||||
|
||||
### Index Build Parameters
|
||||
|
||||
| Parameter | Type | Default | Range | Description |
|
||||
|-----------|------|---------|-------|-------------|
|
||||
| `m` | integer | 16 | 2-128 | Maximum connections per layer |
|
||||
| `ef_construction` | integer | 64 | 4-1000 | Size of dynamic candidate list during build |
|
||||
| `metric` | string | 'l2' | l2/cosine/ip | Distance metric |
|
||||
|
||||
**Parameter Tuning Guidelines:**
|
||||
|
||||
- **`m`**: Higher values improve recall but increase memory usage
|
||||
- Low (8-16): Fast build, lower memory, good for small datasets
|
||||
- Medium (16-32): Balanced performance
|
||||
- High (32-64): Better recall, slower build, more memory
|
||||
|
||||
- **`ef_construction`**: Higher values improve index quality but slow down build
|
||||
- Low (32-64): Fast build, may sacrifice recall
|
||||
- Medium (64-128): Balanced
|
||||
- High (128-500): Best quality, slow build
|
||||
|
||||
### Query-Time Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|-------------|
|
||||
| `ruvector.ef_search` | integer | 40 | Size of dynamic candidate list during search |
|
||||
|
||||
**Setting ef_search:**
|
||||
|
||||
```sql
|
||||
-- Global setting (postgresql.conf or ALTER SYSTEM)
|
||||
ALTER SYSTEM SET ruvector.ef_search = 100;
|
||||
|
||||
-- Session setting (per-connection)
|
||||
SET ruvector.ef_search = 100;
|
||||
|
||||
-- Query with increased recall
|
||||
SET LOCAL ruvector.ef_search = 200;
|
||||
SELECT ... ORDER BY embedding <-> query LIMIT 10;
|
||||
```
|
||||
|
||||
## Distance Metrics
|
||||
|
||||
### L2 (Euclidean) Distance
|
||||
|
||||
- **Operator**: `<->`
|
||||
- **Formula**: `√(Σ(a[i] - b[i])²)`
|
||||
- **Use case**: General-purpose distance
|
||||
- **Range**: [0, ∞)
|
||||
|
||||
```sql
|
||||
CREATE INDEX ON items USING hnsw (embedding hnsw_l2_ops);
|
||||
SELECT * FROM items ORDER BY embedding <-> query_vector LIMIT 10;
|
||||
```
|
||||
|
||||
### Cosine Distance
|
||||
|
||||
- **Operator**: `<=>`
|
||||
- **Formula**: `1 - (a·b)/(||a||·||b||)`
|
||||
- **Use case**: Direction similarity (text embeddings)
|
||||
- **Range**: [0, 2]
|
||||
|
||||
```sql
|
||||
CREATE INDEX ON items USING hnsw (embedding hnsw_cosine_ops);
|
||||
SELECT * FROM items ORDER BY embedding <=> query_vector LIMIT 10;
|
||||
```
|
||||
|
||||
### Inner Product
|
||||
|
||||
- **Operator**: `<#>`
|
||||
- **Formula**: `-Σ(a[i] * b[i])`
|
||||
- **Use case**: Maximum similarity (normalized vectors)
|
||||
- **Range**: (-∞, ∞)
|
||||
|
||||
```sql
|
||||
CREATE INDEX ON items USING hnsw (embedding hnsw_ip_ops);
|
||||
SELECT * FROM items ORDER BY embedding <#> query_vector LIMIT 10;
|
||||
```
|
||||
|
||||
## Performance
|
||||
|
||||
### Build Performance
|
||||
|
||||
- **Time Complexity**: O(N log N) with high probability
|
||||
- **Space Complexity**: O(N * M * L) where L is average layer count
|
||||
- **Typical Build Rate**: 1000-10000 vectors/sec (depends on dimensions)
|
||||
|
||||
### Query Performance
|
||||
|
||||
- **Time Complexity**: O(ef_search * log N)
|
||||
- **Typical Query Time**:
|
||||
- <1ms for 100K vectors (128D)
|
||||
- <5ms for 1M vectors (128D)
|
||||
- <10ms for 10M vectors (128D)
|
||||
|
||||
### Memory Usage
|
||||
|
||||
```
|
||||
Memory per vector ≈ dimensions * 4 bytes + m * 8 bytes * average_layers
|
||||
Average layers ≈ log₂(N) / log₂(m)
|
||||
|
||||
Example (1M vectors, 128D, m=16):
|
||||
- Vector data: 1M * 128 * 4 = 512 MB
|
||||
- Graph edges: 1M * 16 * 8 * 4 = 512 MB
|
||||
- Total: ~1 GB
|
||||
```
|
||||
|
||||
## Operator Classes
|
||||
|
||||
### hnsw_l2_ops
|
||||
|
||||
For L2 (Euclidean) distance on `real[]` vectors.
|
||||
|
||||
```sql
|
||||
CREATE OPERATOR CLASS hnsw_l2_ops
|
||||
FOR TYPE real[] USING hnsw
|
||||
FAMILY hnsw_l2_ops AS
|
||||
OPERATOR 1 <-> (real[], real[]) FOR ORDER BY float_ops,
|
||||
FUNCTION 1 l2_distance_arr(real[], real[]);
|
||||
```
|
||||
|
||||
### hnsw_cosine_ops
|
||||
|
||||
For cosine distance on `real[]` vectors.
|
||||
|
||||
```sql
|
||||
CREATE OPERATOR CLASS hnsw_cosine_ops
|
||||
FOR TYPE real[] USING hnsw
|
||||
FAMILY hnsw_cosine_ops AS
|
||||
OPERATOR 1 <=> (real[], real[]) FOR ORDER BY float_ops,
|
||||
FUNCTION 1 cosine_distance_arr(real[], real[]);
|
||||
```
|
||||
|
||||
### hnsw_ip_ops
|
||||
|
||||
For inner product on `real[]` vectors.
|
||||
|
||||
```sql
|
||||
CREATE OPERATOR CLASS hnsw_ip_ops
|
||||
FOR TYPE real[] USING hnsw
|
||||
FAMILY hnsw_ip_ops AS
|
||||
OPERATOR 1 <#> (real[], real[]) FOR ORDER BY float_ops,
|
||||
FUNCTION 1 neg_inner_product_arr(real[], real[]);
|
||||
```
|
||||
|
||||
## Monitoring and Maintenance
|
||||
|
||||
### Index Statistics
|
||||
|
||||
```sql
|
||||
-- View memory usage
|
||||
SELECT ruvector_memory_stats();
|
||||
|
||||
-- Check index size
|
||||
SELECT pg_size_pretty(pg_relation_size('items_embedding_idx'));
|
||||
|
||||
-- View index definition
|
||||
SELECT indexdef FROM pg_indexes WHERE indexname = 'items_embedding_idx';
|
||||
```
|
||||
|
||||
### Index Maintenance
|
||||
|
||||
```sql
|
||||
-- Perform maintenance (optimize connections, rebuild degraded nodes)
|
||||
SELECT ruvector_index_maintenance('items_embedding_idx');
|
||||
|
||||
-- Vacuum to reclaim space after deletes
|
||||
VACUUM items;
|
||||
|
||||
-- Rebuild index if heavily modified
|
||||
REINDEX INDEX items_embedding_idx;
|
||||
```
|
||||
|
||||
### Query Plan Analysis
|
||||
|
||||
```sql
|
||||
-- Analyze query execution
|
||||
EXPLAIN (ANALYZE, BUFFERS)
|
||||
SELECT id, embedding <-> query AS distance
|
||||
FROM items
|
||||
ORDER BY embedding <-> query
|
||||
LIMIT 10;
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Index Creation
|
||||
|
||||
- Build indexes on stable data when possible
|
||||
- Use higher `ef_construction` for better quality
|
||||
- Consider using `maintenance_work_mem` for large builds:
|
||||
```sql
|
||||
SET maintenance_work_mem = '2GB';
|
||||
CREATE INDEX ...;
|
||||
```
|
||||
|
||||
### 2. Query Optimization
|
||||
|
||||
- Adjust `ef_search` based on recall requirements
|
||||
- Use prepared statements for repeated queries
|
||||
- Consider query result caching for common queries
|
||||
|
||||
### 3. Data Management
|
||||
|
||||
- Normalize vectors for cosine similarity
|
||||
- Batch inserts when possible
|
||||
- Schedule index maintenance during low-traffic periods
|
||||
|
||||
### 4. Monitoring
|
||||
|
||||
- Track index size growth
|
||||
- Monitor query performance metrics
|
||||
- Set up alerts for memory usage
|
||||
|
||||
## Limitations
|
||||
|
||||
### Current Version
|
||||
|
||||
- **Single column only**: Multi-column indexes not supported
|
||||
- **No parallel scans**: Query parallelism not yet implemented
|
||||
- **No index-only scans**: Must access heap tuples
|
||||
- **Array type only**: Custom vector type support coming soon
|
||||
|
||||
### PostgreSQL Version Requirements
|
||||
|
||||
- PostgreSQL 14+
|
||||
- pgrx 0.12+
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Index Build Fails
|
||||
|
||||
**Problem**: Out of memory during index build
|
||||
**Solution**: Increase `maintenance_work_mem` or reduce `ef_construction`
|
||||
|
||||
```sql
|
||||
SET maintenance_work_mem = '4GB';
|
||||
```
|
||||
|
||||
### Slow Queries
|
||||
|
||||
**Problem**: Queries are slower than expected
|
||||
**Solution**: Increase `ef_search` or rebuild index with higher `m`
|
||||
|
||||
```sql
|
||||
SET ruvector.ef_search = 100;
|
||||
```
|
||||
|
||||
### Low Recall
|
||||
|
||||
**Problem**: Not finding correct nearest neighbors
|
||||
**Solution**: Increase `ef_search` or rebuild with higher `ef_construction`
|
||||
|
||||
```sql
|
||||
REINDEX INDEX items_embedding_idx;
|
||||
```
|
||||
|
||||
## Comparison with Other Methods
|
||||
|
||||
| Feature | HNSW | IVFFlat | Brute Force |
|
||||
|---------|------|---------|-------------|
|
||||
| Search Time | O(log N) | O(√N) | O(N) |
|
||||
| Build Time | O(N log N) | O(N) | O(1) |
|
||||
| Memory | High | Medium | Low |
|
||||
| Recall | >95% | >90% | 100% |
|
||||
| Updates | Good | Poor | Excellent |
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
- [ ] Parallel index scans
|
||||
- [ ] Custom vector type support
|
||||
- [ ] Index-only scans
|
||||
- [ ] Dynamic parameter tuning
|
||||
- [ ] Graph compression
|
||||
- [ ] Multi-column indexes
|
||||
- [ ] Distributed HNSW
|
||||
|
||||
## References
|
||||
|
||||
1. Malkov, Y. A., & Yashunin, D. A. (2018). "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs." IEEE transactions on pattern analysis and machine intelligence.
|
||||
|
||||
2. PostgreSQL Index Access Method documentation: https://www.postgresql.org/docs/current/indexam.html
|
||||
|
||||
3. pgrx documentation: https://github.com/pgcentralfoundation/pgrx
|
||||
|
||||
## License
|
||||
|
||||
MIT License - See LICENSE file for details.
|
||||
Reference in New Issue
Block a user