Files
wifi-densepose/docs/research/RUVECTOR_PGLITE_CRITICAL_GAPS.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

407 lines
13 KiB
Markdown

# Critical Gaps in RuVector-PGlite Implementation Plan
## 🚨 Major Architectural Flaws Discovered
After researching actual PGlite extension development, the original implementation plan has **critical flaws** that must be addressed.
---
## ❌ What Was WRONG in the Original Plan
### 1. **pgrx Does NOT Support WASM Compilation**
**Original Assumption**: Use pgrx with `wasm32-unknown-unknown` target
```toml
# ❌ THIS DOESN'T WORK
[lib]
crate-type = ["cdylib"]
[target.wasm32-unknown-unknown]
# pgrx is not designed for WASM target
```
**Reality**:
- pgrx is designed to build **native** PostgreSQL extensions (.so, .dylib, .dll)
- pgrx is used to build extensions that **run** WebAssembly (via Extism), not extensions **compiled to** WebAssembly
- No evidence of pgrx supporting wasm32 as a compilation target
**Sources**:
- [pgrx WebAssembly research](https://dylibso.com/blog/pg-extism/)
- No wasm32 target in pgrx documentation
### 2. **Wrong Build Toolchain**
**Original Plan**: `cargo pgrx package --target wasm32`
**Reality**: PGlite extensions require:
```bash
# ✅ CORRECT: Emscripten toolchain
emcc -o extension.wasm extension.c \
-I$POSTGRES_INCLUDE \
-s MAIN_MODULE=1 \
-s ASYNCIFY
```
**Required Tools**:
- ✅ Emscripten SDK (emsdk)
- ✅ PostgreSQL headers for WASM
- ✅ Tar packaging for `.tar.gz` bundles
- ❌ NOT cargo pgrx
### 3. **Misunderstood Extension Structure**
**Original Plan**: Build a standalone `.wasm` file
**Reality**: PGlite extensions are `.tar.gz` tarballs containing:
```
vector.tar.gz
├── extension/
│ ├── vector.so.wasm # WASM compiled extension
│ ├── vector.control # Extension metadata
│ ├── vector--*.sql # SQL install scripts
│ └── data/ # Any data files
```
**Actual pgvector Implementation**:
```typescript
// packages/pglite/src/vector/index.ts
const setup = async (_pg: PGliteInterface, emscriptenOpts: any) => {
return {
emscriptenOpts,
bundlePath: new URL('../../release/vector.tar.gz', import.meta.url),
}
}
export const vector = { name: 'pgvector', setup }
```
**Source**: [PGlite vector extension source](https://github.com/electric-sql/pglite/blob/main/packages/pglite/src/vector/index.ts)
### 4. **Missing Build Process Details**
**What Was Missing**:
- How to clone PGlite with submodules
- How to add extension to `postgres-pglite/pglite/Makefile`
- How to build within PGlite's build system
- Emscripten compilation flags (MAIN_MODULE, ASYNCIFY)
- Tarball packaging steps
**Actual Process** ([source](https://github.com/electric-sql/pglite/blob/main/docs/extensions/development.md)):
```bash
# 1. Clone PGlite
git clone --recurse-submodules git@github.com:electric-sql/pglite.git
cd pglite && pnpm i
# 2. Add extension as submodule
cd postgres-pglite/pglite
git submodule add <extension_url>
# 3. Register in Makefile
echo "SUBDIRS += ruvector" >> Makefile
# 4. Build (creates .tar.gz)
pnpm build:all
# Output: packages/pglite/release/ruvector.tar.gz
```
### 5. **No Rust-Specific Guidance**
**Gap**: How to write a Rust extension that compiles with Emscripten?
**Missing Details**:
- Rust → C FFI interface layer
- `#[no_mangle]` exports for PostgreSQL API
- Memory management (Emscripten vs Rust allocator)
- Build script for `emcc` + `rustc`
**Possible Approaches**:
#### Option A: Pure C Extension
```c
// ruvector_pglite.c
#include "postgres.h"
#include "fmgr.h"
PG_MODULE_MAGIC;
PG_FUNCTION_INFO_V1(vector_cosine_distance);
Datum vector_cosine_distance(PG_FUNCTION_ARGS) {
// Call Rust library via FFI
float32 result = rust_cosine_distance(...);
PG_RETURN_FLOAT4(result);
}
```
Then compile:
```bash
emcc -o ruvector.wasm ruvector_pglite.c libruvector_core.a \
-I$PG_INCLUDE -s MAIN_MODULE=1
```
#### Option B: Rust with C Wrapper
```rust
// ruvector_core/src/ffi.rs
#[no_mangle]
pub extern "C" fn rust_cosine_distance(
a: *const f32,
b: *const f32,
len: usize
) -> f32 {
// Safe Rust implementation
}
```
Then build:
```bash
# Build Rust to WASM staticlib
cargo build --target wasm32-unknown-emscripten --release
# Link with C wrapper
emcc -o ruvector.wasm wrapper.c libruvector_core.a \
-I$PG_INCLUDE -s MAIN_MODULE=1
```
### 6. **Size Targets May Be Unrealistic**
**Original Target**: 500KB-1MB WASM
**Reality Check**:
- pgvector (minimal, C-based): ~200KB compiled to WASM
- Full ruvector features (even stripped): likely 2-5MB
- Rust std library adds ~100-300KB
- PostgreSQL runtime overhead: varies
**Revised Targets**:
- Minimal (types + distances): ~500KB-1MB ✅
- With HNSW index: ~1-2MB
- With quantization: ~2-3MB
- Full features: 5-10MB (defeats purpose)
### 7. **No TypeScript Plugin API Consideration**
**Missing Alternative**: PGlite's custom plugin API
Instead of a PostgreSQL extension, could build a **TypeScript plugin** that provides vector operations via PGlite's namespace API:
```typescript
// Hybrid approach: TypeScript + WASM compute kernel
import { Extension } from '@electric-sql/pglite'
import init, { cosineDistance } from './ruvector_core.wasm'
const setup = async (pg: PGliteInterface) => {
await init() // Initialize WASM
return {
namespaceObj: {
vector: {
cosineDistance: (a: Float32Array, b: Float32Array) =>
cosineDistance(a, b), // WASM function
// Other vector operations...
}
}
}
}
export const ruvector = { name: 'ruvector', setup }
```
**Usage**:
```typescript
const db = await PGlite.create({ extensions: { ruvector } })
// Use via JavaScript API (not SQL)
const dist = db.ruvector.vector.cosineDistance(vec1, vec2)
```
**Pros**:
- ✅ No Emscripten/PostgreSQL build complexity
- ✅ Direct WASM (no PostgreSQL FFI overhead)
- ✅ Easier to build and maintain
- ✅ Can still use Rust → wasm-bindgen
**Cons**:
- ❌ Not SQL-compatible (no `SELECT ... ORDER BY embedding <=> $1`)
- ❌ Can't use PostgreSQL indexes
- ❌ Not a drop-in pgvector replacement
---
## ✅ What's ACTUALLY Needed
### Corrected Architecture Options
#### **Option 1: Full PostgreSQL Extension (Complex but SQL-compatible)**
```
┌─────────────────────────────────────────┐
│ ruvector-core (Rust library) │
│ - Vector types, distances, HNSW │
│ - Compiles to: libruvector_core.a │
│ - Target: wasm32-unknown-emscripten │
└─────────────────────────────────────────┘
│ C FFI
┌─────────────┴───────────────────────────┐
│ ruvector_pglite_wrapper.c │
│ - PostgreSQL extension entry points │
│ - PG_FUNCTION_INFO_V1 macros │
│ - Calls Rust via FFI │
└─────────────────────────────────────────┘
│ Emscripten
┌─────────────────────────────────────────┐
│ ruvector.tar.gz │
│ ├── ruvector.so.wasm │
│ ├── ruvector.control │
│ └── ruvector--0.1.0.sql │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
│ @ruvector/pglite (TypeScript) │
│ - Extension loader │
│ - Minimal wrapper (like pgvector) │
└─────────────────────────────────────────┘
```
**Build Process**:
1. Fork PGlite repo
2. Add ruvector as submodule in `postgres-pglite/pglite/`
3. Create Makefile with Emscripten rules
4. Build Rust core to WASM staticlib
5. Link with C wrapper
6. Package to .tar.gz
7. Create TypeScript loader
**Pros**: ✅ Full SQL compatibility, ✅ PostgreSQL indexes
**Cons**: ❌ Complex build, ❌ Large size, ❌ Tight coupling to PGlite
#### **Option 2: Hybrid TypeScript Plugin (Simpler, WASM-native)**
```
┌─────────────────────────────────────────┐
│ ruvector-core (Rust library) │
│ - Vector operations only │
│ - No PostgreSQL dependencies │
│ - wasm-bindgen for JS interop │
│ - Target: wasm32-unknown-unknown │
└─────────────────────────────────────────┘
│ wasm-pack
┌─────────────────────────────────────────┐
│ ruvector_core_bg.wasm + .js glue │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
│ @ruvector/pglite (TypeScript plugin) │
│ - PGlite Extension interface │
│ - Namespace API for vector ops │
│ - SQL function wrappers (via exec) │
└─────────────────────────────────────────┘
```
**Build Process**:
```bash
# 1. Build Rust to WASM
cd ruvector-core
wasm-pack build --target web
# 2. Create TypeScript wrapper
cd ../npm/packages/pglite
pnpm build
# 3. Publish
pnpm publish
```
**Pros**: ✅ Simple build, ✅ Small size, ✅ Easy maintenance
**Cons**: ❌ Limited SQL integration, ❌ No native indexes
#### **Option 3: Minimal SQL Extension (Best Balance)**
Start with **Option 1** but with **minimal features**:
- ✅ Core vector type
- ✅ Distance operators (`<->`, `<=>`, `<#>`)
- ❌ Skip HNSW (use flat scan)
- ❌ Skip quantization
- ❌ Skip advanced features
**Target Size**: ~200-500KB (comparable to pgvector)
---
## 📋 Revised Implementation Checklist
### Prerequisites
- [ ] Clone PGlite repo with submodules
- [ ] Install Emscripten SDK (emsdk)
- [ ] Study pgvector's PGlite implementation
- [ ] Understand PGlite's build system
### Development
- [ ] Create `ruvector-core` (no PostgreSQL deps)
- [ ] Add C FFI layer (`ffi.rs` with `#[no_mangle]`)
- [ ] Write C wrapper (`ruvector_wrapper.c`)
- [ ] Create extension control file (`ruvector.control`)
- [ ] Write SQL install script (`ruvector--0.1.0.sql`)
### Building
- [ ] Add as submodule to PGlite
- [ ] Configure Emscripten Makefile
- [ ] Build Rust to WASM staticlib
- [ ] Link with C wrapper using emcc
- [ ] Package to .tar.gz
### Testing
- [ ] Write Vitest tests
- [ ] Test in browser environment
- [ ] Benchmark against pgvector
- [ ] Validate SQL compatibility
### Publishing
- [ ] Create TypeScript loader
- [ ] Add to PGlite extensions catalog
- [ ] Publish to npm
- [ ] Write documentation
---
## 🎯 Recommended Path Forward
**Start with Option 2 (TypeScript Plugin)** for these reasons:
1. **Immediate Value**: Can ship in 1-2 weeks vs 6+ weeks
2. **Learning Path**: Understand PGlite before committing to Option 1
3. **Proof of Concept**: Validate demand for ruvector-pglite
4. **Simpler**: No Emscripten complexity
5. **Upgradeable**: Can migrate to Option 1 later if SQL is critical
**Then, if SQL compatibility is required**, upgrade to Option 1 (Full Extension).
---
## 📚 Additional Research Needed
1. **Emscripten + Rust**: Best practices for compiling Rust to WASM with emcc
2. **PGlite Build System**: Deep dive into their Makefile and build scripts
3. **PostgreSQL C API**: Required functions for minimal extension
4. **Memory Management**: Emscripten's memory model vs Rust
5. **Size Optimization**: Dead code elimination, LTO for WASM
---
## Sources
- [PGlite Extension Development Guide](https://github.com/electric-sql/pglite/blob/main/docs/extensions/development.md)
- [PGlite pgvector Implementation](https://github.com/electric-sql/pglite/blob/main/packages/pglite/src/vector/index.ts)
- [In-Browser Semantic Search with PGlite](https://supabase.com/blog/in-browser-semantic-search-pglite)
- [PGlite Extensions Catalog](https://pglite.dev/extensions/)
- [Bringing WebAssembly to PostgreSQL](https://dylibso.com/blog/pg-extism/)
- [Compiling Postgres to WASM with PGlite](https://development.tldrecap.tech/posts/pgconf-2025/postgresql-pg-lite-webassembly-wasm/)
---
**Next Step**: Choose an implementation option and update the plan accordingly.