Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

2026-02-28 14:39:40 -05:00
parent 7885bf6278 d803bfe2b1
commit cd5943df23
7854 changed files with 3522914 additions and 0 deletions
--- a/vendor/ruvector/docs/research/sparql/README.md
+++ b/vendor/ruvector/docs/research/sparql/README.md
@@ -0,0 +1,431 @@
+# SPARQL Research Documentation
+
+**Research Phase: Complete**
+**Date**: December 2025
+**Project**: RuVector-Postgres SPARQL Extension
+
+---
+
+## Overview
+
+This directory contains comprehensive research documentation for implementing SPARQL (SPARQL Protocol and RDF Query Language) query capabilities in the RuVector-Postgres extension. The research covers SPARQL 1.1 specification, implementation strategies, and integration with existing vector search capabilities.
+
+---
+
+## Research Documents
+
+### 📘 [SPARQL_SPECIFICATION.md](./SPARQL_SPECIFICATION.md)
+**Complete technical specification** - 8,000+ lines
+
+Comprehensive coverage of SPARQL 1.1 including:
+- Core components (RDF triples, graph patterns, query forms)
+- Complete syntax reference (PREFIX, variables, URIs, literals, blank nodes)
+- All operations (pattern matching, FILTER, OPTIONAL, UNION, property paths)
+- Update operations (INSERT, DELETE, LOAD, CLEAR, CREATE, DROP)
+- 50+ built-in functions (string, numeric, date/time, hash, aggregates)
+- SPARQL algebra (BGP, Join, LeftJoin, Filter, Union operators)
+- Query result formats (JSON, XML, CSV, TSV)
+- PostgreSQL implementation considerations
+
+**Use this for**: Deep understanding of SPARQL semantics and formal specification.
+
+---
+
+### 🏗️ [IMPLEMENTATION_GUIDE.md](./IMPLEMENTATION_GUIDE.md)
+**Practical implementation roadmap** - 5,000+ lines
+
+Detailed implementation strategy covering:
+- Architecture overview (parser, algebra, SQL generator)
+- Data model design (triple store schema, indexes, custom types)
+- Core functions (RDF operations, namespace management)
+- Query translation (SPARQL → SQL conversion)
+- Optimization strategies (statistics, caching, materialized views)
+- RuVector integration (hybrid SPARQL + vector queries)
+- 12-week implementation roadmap
+- Testing strategy and performance targets
+
+**Use this for**: Building the SPARQL engine implementation.
+
+---
+
+### 📚 [EXAMPLES.md](./EXAMPLES.md)
+**50 practical query examples**
+
+Real-world SPARQL query examples:
+- Basic queries (SELECT, ASK, CONSTRUCT, DESCRIBE)
+- Filtering and constraints
+- Optional patterns
+- Property paths (transitive, inverse, alternative)
+- Aggregation (COUNT, SUM, AVG, GROUP BY, HAVING)
+- Update operations (INSERT, DELETE, LOAD, CLEAR)
+- Named graphs
+- Hybrid queries (SPARQL + vector similarity)
+- Advanced patterns (subqueries, VALUES, BIND, negation)
+
+**Use this for**: Learning SPARQL syntax and seeing practical applications.
+
+---
+
+### ⚡ [QUICK_REFERENCE.md](./QUICK_REFERENCE.md)
+**One-page cheat sheet**
+
+Fast reference for:
+- Query forms and basic syntax
+- Triple patterns and abbreviations
+- Graph patterns (OPTIONAL, UNION, FILTER, BIND)
+- Property path operators
+- Solution modifiers (ORDER BY, LIMIT, OFFSET)
+- All built-in functions
+- Update operations
+- Common patterns and performance tips
+
+**Use this for**: Quick lookup during development.
+
+---
+
+## Key Research Findings
+
+### 1. SPARQL 1.1 Core Features
+
+**Query Forms:**
+- SELECT: Return variable bindings as table
+- CONSTRUCT: Build new RDF graph from template
+- ASK: Return boolean if pattern matches
+- DESCRIBE: Return implementation-specific resource description
+
+**Essential Operations:**
+- Basic Graph Patterns (BGP): Conjunction of triple patterns
+- OPTIONAL: Left outer join for optional patterns
+- UNION: Disjunction (alternatives)
+- FILTER: Constraint satisfaction
+- Property Paths: Regular expression-like navigation
+- Aggregates: COUNT, SUM, AVG, MIN, MAX, GROUP_CONCAT, SAMPLE
+
+**Update Operations:**
+- INSERT DATA / DELETE DATA: Ground triples
+- DELETE/INSERT WHERE: Pattern-based updates
+- LOAD: Import RDF documents
+- Graph management: CREATE, DROP, CLEAR, COPY, MOVE, ADD
+
+---
+
+### 2. Implementation Strategy for PostgreSQL
+
+#### Data Model
+
+```sql
+-- Efficient triple store with multiple indexes
+CREATE TABLE ruvector_rdf_triples (
+    id BIGSERIAL PRIMARY KEY,
+    subject TEXT NOT NULL,
+    subject_type VARCHAR(10) NOT NULL,
+    predicate TEXT NOT NULL,
+    object TEXT NOT NULL,
+    object_type VARCHAR(10) NOT NULL,
+    object_datatype TEXT,
+    object_language VARCHAR(20),
+    graph TEXT
+);
+
+-- Covering indexes for all access patterns
+CREATE INDEX idx_rdf_spo ON ruvector_rdf_triples(subject, predicate, object);
+CREATE INDEX idx_rdf_pos ON ruvector_rdf_triples(predicate, object, subject);
+CREATE INDEX idx_rdf_osp ON ruvector_rdf_triples(object, subject, predicate);
+```
+
+#### Query Translation Pipeline
+
+```
+SPARQL Query Text
+      ↓
+  Parse (Rust parser)
+      ↓
+SPARQL Algebra (BGP, Join, LeftJoin, Filter, Union)
+      ↓
+  Optimize (Statistics-based join ordering)
+      ↓
+SQL Generation (PostgreSQL queries with CTEs)
+      ↓
+ Execute & Format Results (JSON/XML/CSV/TSV)
+```
+
+#### Key Translation Patterns
+
+- **BGP → JOIN**: Triple patterns become table joins
+- **OPTIONAL → LEFT JOIN**: Optional patterns become left outer joins
+- **UNION → UNION ALL**: Alternative patterns combine results
+- **FILTER → WHERE**: Constraints translate to SQL WHERE clauses
+- **Property Paths → CTE**: Recursive CTEs for transitive closure
+- **Aggregates → GROUP BY**: Direct mapping to SQL aggregates
+
+---
+
+### 3. Performance Optimization
+
+**Critical Optimizations:**
+
+1. **Multi-pattern indexes**: SPO, POS, OSP covering all join orders
+2. **Statistics collection**: Predicate selectivity for join ordering
+3. **Materialized views**: Pre-compute common property paths
+4. **Query result caching**: Cache parsed queries and compiled SQL
+5. **Prepared statements**: Reduce parsing overhead
+6. **Parallel execution**: Leverage PostgreSQL parallel query
+
+**Target Performance** (1M triples):
+- Simple BGP (3 patterns): < 10ms
+- Complex query (joins + filters): < 100ms
+- Property path (depth 5): < 500ms
+- Aggregate query: < 200ms
+- Bulk insert (1000 triples): < 100ms
+
+---
+
+### 4. RuVector Integration Opportunities
+
+#### Hybrid Semantic + Vector Search
+
+Combine SPARQL graph patterns with vector similarity:
+
+```sql
+-- Find similar people matching graph patterns
+SELECT
+  r.subject AS person,
+  r.object AS name,
+  e.embedding <=> $1::ruvector AS similarity
+FROM ruvector_rdf_triples r
+JOIN person_embeddings e ON r.subject = e.person_iri
+WHERE r.predicate = 'http://xmlns.com/foaf/0.1/name'
+  AND e.embedding <=> $1::ruvector < 0.5
+ORDER BY similarity
+LIMIT 10;
+```
+
+#### Use Cases
+
+1. **Knowledge Graph Search**: Find entities matching semantic patterns
+2. **Multi-modal Retrieval**: Combine text patterns with vector similarity
+3. **Hierarchical Embeddings**: Use hyperbolic distances in RDF hierarchies
+4. **Contextual RAG**: Use knowledge graph to enrich vector search context
+5. **Agent Routing**: Use SPARQL to query agent capabilities + vector match
+
+---
+
+## Implementation Roadmap
+
+### Phase 1: Foundation (Weeks 1-2)
+- Triple store schema and indexes
+- Basic RDF manipulation functions
+- Namespace management
+
+### Phase 2: Parser (Weeks 3-4)
+- SPARQL 1.1 query parser
+- Parse all query forms and patterns
+
+### Phase 3: Algebra (Week 5)
+- Translate to SPARQL algebra
+- Handle all operators
+
+### Phase 4: SQL Generation (Weeks 6-7)
+- Generate optimized PostgreSQL queries
+- Statistics-based optimization
+
+### Phase 5: Query Execution (Week 8)
+- Execute and format results
+- Support all result formats
+
+### Phase 6: Update Operations (Week 9)
+- Implement all update operations
+- Transaction support
+
+### Phase 7: Optimization (Week 10)
+- Caching and materialization
+- Performance tuning
+
+### Phase 8: RuVector Integration (Week 11)
+- Hybrid SPARQL + vector queries
+- Semantic knowledge graph search
+
+### Phase 9: Testing & Documentation (Week 12)
+- W3C test suite compliance
+- Performance benchmarks
+- User documentation
+
+**Total Timeline**: 12 weeks to production-ready implementation
+
+---
+
+## Standards Compliance
+
+### W3C Specifications Covered
+
+- ✅ SPARQL 1.1 Query Language (March 2013)
+- ✅ SPARQL 1.1 Update (March 2013)
+- ✅ SPARQL 1.1 Property Paths
+- ✅ SPARQL 1.1 Results JSON Format
+- ✅ SPARQL 1.1 Results XML Format
+- ✅ SPARQL 1.1 Results CSV/TSV Formats
+- ⚠️ SPARQL 1.2 (Draft - future consideration)
+
+### Test Coverage
+
+- W3C SPARQL 1.1 Query Test Suite
+- W3C SPARQL 1.1 Update Test Suite
+- Property Path Test Cases
+- Custom RuVector integration tests
+
+---
+
+## Technology Stack
+
+### Core Dependencies
+
+**Parser**: Rust crates
+- `sparql-parser` or `oxigraph` - SPARQL parsing
+- `pgrx` - PostgreSQL extension framework
+- `serde_json` - JSON serialization
+
+**Database**: PostgreSQL 14+
+- Native table storage for triples
+- B-tree and GIN indexes
+- Recursive CTEs for property paths
+- JSON/JSONB for result formatting
+
+**Integration**: RuVector
+- Vector similarity functions
+- Hyperbolic embeddings
+- Hybrid query capabilities
+
+---
+
+## Research Sources
+
+### Primary Sources
+
+1. [W3C SPARQL 1.1 Query Language](https://www.w3.org/TR/sparql11-query/) - Official specification
+2. [W3C SPARQL 1.1 Update](https://www.w3.org/TR/sparql11-update/) - Update operations
+3. [W3C SPARQL 1.1 Property Paths](https://www.w3.org/TR/sparql11-property-paths/) - Path expressions
+4. [W3C SPARQL Algebra](https://www.w3.org/2001/sw/DataAccess/rq23/rq24-algebra.html) - Formal semantics
+
+### Implementation References
+
+5. [Apache Jena](https://jena.apache.org/) - Reference implementation
+6. [Oxigraph](https://github.com/oxigraph/oxigraph) - Rust implementation
+7. [Virtuoso](https://virtuoso.openlinksw.com/) - High-performance triple store
+8. [GraphDB](https://graphdb.ontotext.com/) - Enterprise semantic database
+
+### Academic Papers
+
+9. TU Dresden SPARQL Algebra Lectures
+10. "The Case of SPARQL UNION, FILTER and DISTINCT" (ACM 2022)
+11. "The complexity of regular expressions and property paths in SPARQL"
+
+---
+
+## Next Steps
+
+### For Implementation Team
+
+1. **Review Documentation**: Read all four research documents
+2. **Setup Environment**:
+   - Install PostgreSQL 14+
+   - Setup pgrx development environment
+   - Clone RuVector-Postgres codebase
+3. **Create GitHub Issues**: Break down roadmap into trackable issues
+4. **Begin Phase 1**: Start with triple store schema implementation
+5. **Iterative Development**: Follow 12-week roadmap with weekly demos
+
+### For Integration Testing
+
+1. Setup W3C SPARQL test suite
+2. Create RuVector-specific test cases
+3. Benchmark performance targets
+4. Document hybrid query patterns
+
+### For Documentation
+
+1. API reference for SQL functions
+2. Tutorial for common use cases
+3. Migration guide from other triple stores
+4. Performance tuning guide
+
+---
+
+## Success Metrics
+
+### Functional Requirements
+- ✅ Complete SPARQL 1.1 Query support
+- ✅ Complete SPARQL 1.1 Update support
+- ✅ All built-in functions implemented
+- ✅ Property paths (including transitive closure)
+- ✅ All result formats (JSON, XML, CSV, TSV)
+- ✅ Named graph support
+
+### Performance Requirements
+- ✅ < 10ms for simple BGP queries
+- ✅ < 100ms for complex joins
+- ✅ < 500ms for property paths
+- ✅ 1M+ triples supported
+- ✅ W3C test suite: 95%+ pass rate
+
+### Integration Requirements
+- ✅ Hybrid SPARQL + vector queries
+- ✅ Seamless RuVector function integration
+- ✅ Knowledge graph embeddings
+- ✅ Semantic search capabilities
+
+---
+
+## Research Completion Summary
+
+### Scope Covered
+
+✅ **Complete SPARQL 1.1 specification research**
+- All query forms documented
+- All operations and patterns covered
+- Complete function reference
+- Formal algebra and semantics
+
+✅ **Implementation strategy defined**
+- Data model designed
+- Query translation pipeline specified
+- Optimization strategies identified
+- Performance targets established
+
+✅ **Integration approach designed**
+- RuVector hybrid query patterns
+- Vector + graph search strategies
+- Knowledge graph embedding approaches
+
+✅ **Documentation complete**
+- 20,000+ lines of research documentation
+- 50 practical examples
+- Quick reference cheat sheet
+- Implementation roadmap
+
+### Ready for Development
+
+All necessary research is **complete** and documented. The implementation team has:
+
+1. **Complete specification** to guide implementation
+2. **Detailed roadmap** with 12-week timeline
+3. **Practical examples** for testing and validation
+4. **Integration strategy** for RuVector hybrid queries
+5. **Performance targets** for optimization
+
+**Status**: ✅ Research Phase Complete - Ready to Begin Implementation
+
+---
+
+## Contact & Support
+
+For questions about this research:
+- Review the four documentation files in this directory
+- Check the W3C specifications linked throughout
+- Consult the RuVector-Postgres main README
+- Refer to Apache Jena and Oxigraph implementations
+
+---
+
+**Documentation Version**: 1.0
+**Last Updated**: December 2025
+**Maintainer**: RuVector Research Team