# RuVector Dataset Discovery Framework **Find hidden patterns and connections in massive datasets that traditional tools miss.** RuVector turns your dataβ€”research papers, climate records, financial filingsβ€”into a connected graph, then uses cutting-edge algorithms to spot emerging trends, cross-domain relationships, and regime shifts *before* they become obvious. ## Why RuVector? Most data analysis tools excel at answering questions you already know to ask. RuVector is different: it helps you **discover what you don't know you're looking for**. **Real-world examples:** - πŸ”¬ **Research**: Spot a new field forming 6-12 months before it gets a name, by detecting when papers start citing across traditional boundaries - 🌍 **Climate**: Detect regime shifts in weather patterns that correlate with economic disruptions - πŸ’° **Finance**: Find companies whose narratives are diverging from their peersβ€”often an early warning signal ## Features | Feature | What It Does | Why It Matters | |---------|--------------|----------------| | **Vector Memory** | Stores data as 384-1536 dim embeddings | Similar concepts cluster together automatically | | **HNSW Index** | O(log n) approximate nearest neighbor search | 10-50x faster than brute force for large datasets | | **Graph Structure** | Connects related items with weighted edges | Reveals hidden relationships in your data | | **Min-Cut Analysis** | Measures how "connected" your network is | Detects regime changes and fragmentation | | **Cross-Domain Detection** | Finds bridges between different fields | Discovers unexpected correlations (e.g., climate β†’ finance) | | **ONNX Embeddings** | Neural semantic embeddings (MiniLM, BGE, etc.) | Production-quality text understanding | | **Causality Testing** | Checks if changes in X predict changes in Y | Moves beyond correlation to actionable insights | | **Statistical Rigor** | Reports p-values and effect sizes | Know which findings are real vs. noise | ### What's New in v0.3.0 - **HNSW Integration**: O(n log n) similarity search replaces O(nΒ²) brute force - **Similarity Cache**: 2-3x speedup for repeated similarity queries - **Batch ONNX Embeddings**: Chunked processing with progress callbacks - **Shared Utils Module**: `cosine_similarity`, `euclidean_distance`, `normalize_vector` - **Auto-connect by Embeddings**: CoherenceEngine creates edges from vector similarity ### Performance - ⚑ **10-50x faster** similarity search (HNSW vs brute force) - ⚑ **8.8x faster** batch vector insertion (parallel processing) - ⚑ **2.9x faster** similarity computation (SIMD acceleration) - ⚑ **2-3x faster** repeated queries (similarity cache) - πŸ“Š Works with **millions of records** on standard hardware ## Quick Start ### Prerequisites ```bash # Ensure you're in the ruvector workspace cd /workspaces/ruvector ``` ### Run Your First Example ```bash # 1. Performance benchmark - see the speed improvements cargo run --example optimized_benchmark -p ruvector-data-framework --features parallel --release # 2. Discovery hunter - find patterns in sample data cargo run --example discovery_hunter -p ruvector-data-framework --features parallel --release # 3. Cross-domain analysis - detect bridges between fields cargo run --example cross_domain_discovery -p ruvector-data-framework --release ``` ### Domain-Specific Examples ```bash # Climate: Detect weather regime shifts cargo run --example regime_detector -p ruvector-data-climate # Finance: Monitor corporate filing coherence cargo run --example coherence_watch -p ruvector-data-edgar ``` ### What You'll See ``` πŸ” Discovery Results: Pattern: Climate ↔ Finance bridge detected Strength: 0.73 (strong connection) P-value: 0.031 (statistically significant) β†’ Drought indices may predict utility sector performance with a 3-period lag ``` ## The Discovery Thesis RuVector's unique combination of **vector memory**, **graph structures**, and **dynamic minimum cut algorithms** enables discoveries that most analysis tools miss: - **Emerging patterns before they have names**: Detect topic splits and merges as cut boundaries shift over time - **Non-obvious cross-domain bridges**: Find small "connector" subgraphs where disciplines quietly start citing each other - **Causal leverage maps**: Link funders, labs, venues, and downstream citations to spot high-impact intervention points - **Regime shifts in time series**: Use coherence breaks to flag fundamental changes in system behavior ## Tutorial ### 1. Creating the Engine ```rust use ruvector_data_framework::optimized::{ OptimizedDiscoveryEngine, OptimizedConfig, }; use ruvector_data_framework::ruvector_native::{ Domain, SemanticVector, }; let config = OptimizedConfig { similarity_threshold: 0.55, // Minimum cosine similarity mincut_sensitivity: 0.10, // Coherence change threshold cross_domain: true, // Enable cross-domain discovery use_simd: true, // SIMD acceleration significance_threshold: 0.05, // P-value threshold causality_lookback: 12, // Temporal lookback periods ..Default::default() }; let mut engine = OptimizedDiscoveryEngine::new(config); ``` ### 2. Adding Data ```rust use std::collections::HashMap; use chrono::Utc; // Single vector let vector = SemanticVector { id: "climate_drought_2024".to_string(), embedding: generate_embedding(), // 128-dim vector domain: Domain::Climate, timestamp: Utc::now(), metadata: HashMap::from([ ("region".to_string(), "sahel".to_string()), ("severity".to_string(), "extreme".to_string()), ]), }; let node_id = engine.add_vector(vector); // Batch insertion (8.8x faster) #[cfg(feature = "parallel")] { let vectors: Vec = load_vectors(); let node_ids = engine.add_vectors_batch(vectors); } ``` ### 3. Computing Coherence ```rust let snapshot = engine.compute_coherence(); println!("Min-cut value: {:.3}", snapshot.mincut_value); println!("Partition sizes: {:?}", snapshot.partition_sizes); println!("Boundary nodes: {:?}", snapshot.boundary_nodes); ``` **Interpretation:** | Min-cut Trend | Meaning | |---------------|---------| | Rising | Network consolidating, stronger connections | | Falling | Fragmentation, potential regime change | | Stable | Steady state, consistent structure | ### 4. Pattern Detection ```rust let patterns = engine.detect_patterns_with_significance(); for pattern in patterns.iter().filter(|p| p.is_significant) { println!("{}", pattern.pattern.description); println!(" P-value: {:.4}", pattern.p_value); println!(" Effect size: {:.3}", pattern.effect_size); } ``` **Pattern Types:** | Type | Description | Example | |------|-------------|---------| | `CoherenceBreak` | Min-cut dropped significantly | Network fragmentation crisis | | `Consolidation` | Min-cut increased | Market convergence | | `BridgeFormation` | Cross-domain connections | Climate-finance link | | `Cascade` | Temporal causality | Climate β†’ Finance lag-3 | | `EmergingCluster` | New dense subgraph | Research topic emerging | ### 5. Cross-Domain Analysis ```rust // Check coupling strength let stats = engine.stats(); let coupling = stats.cross_domain_edges as f64 / stats.total_edges as f64; println!("Cross-domain coupling: {:.1}%", coupling * 100.0); // Domain coherence scores for domain in [Domain::Climate, Domain::Finance, Domain::Research] { if let Some(coh) = engine.domain_coherence(domain) { println!("{:?}: {:.3}", domain, coh); } } ``` ## Performance Benchmarks | Operation | Baseline | Optimized | Speedup | |-----------|----------|-----------|---------| | Vector Insertion | 133ms | 15ms | **8.84x** | | SIMD Cosine | 432ms | 148ms | **2.91x** | | Pattern Detection | 524ms | 655ms | - | ## Datasets ### 1. OpenAlex (Research Intelligence) **Best for**: Emerging field detection, cross-discipline bridges - 250M+ works, 90M+ authors - Native graph structure - Bulk download + API access ```rust use ruvector_data_openalex::{OpenAlexConfig, FrontierRadar}; let radar = FrontierRadar::new(OpenAlexConfig::default()); let frontiers = radar.detect_emerging_topics(papers); ``` ### 2. NOAA + NASA (Climate Intelligence) **Best for**: Regime shift detection, anomaly prediction - Weather observations, satellite imagery - Time series β†’ graph transformation - Economic risk modeling ```rust use ruvector_data_climate::{ClimateConfig, RegimeDetector}; let detector = RegimeDetector::new(config); let shifts = detector.detect_shifts(); ``` ### 3. SEC EDGAR (Financial Intelligence) **Best for**: Corporate risk signals, peer divergence - XBRL financial statements - 10-K/10-Q filings - Narrative + fundamental analysis ```rust use ruvector_data_edgar::{EdgarConfig, CoherenceMonitor}; let monitor = CoherenceMonitor::new(config); let alerts = monitor.analyze_filing(filing); ``` ## Directory Structure ``` examples/data/ β”œβ”€β”€ README.md # This file β”œβ”€β”€ Cargo.toml # Workspace manifest β”œβ”€β”€ framework/ # Core discovery framework β”‚ β”œβ”€β”€ src/ β”‚ β”‚ β”œβ”€β”€ lib.rs # Framework exports β”‚ β”‚ β”œβ”€β”€ ruvector_native.rs # Native engine with Stoer-Wagner β”‚ β”‚ β”œβ”€β”€ optimized.rs # SIMD + parallel optimizations β”‚ β”‚ β”œβ”€β”€ coherence.rs # Coherence signal computation β”‚ β”‚ β”œβ”€β”€ discovery.rs # Pattern detection β”‚ β”‚ └── ingester.rs # Data ingestion β”‚ └── examples/ β”‚ β”œβ”€β”€ cross_domain_discovery.rs # Cross-domain patterns β”‚ β”œβ”€β”€ optimized_benchmark.rs # Performance comparison β”‚ └── discovery_hunter.rs # Novel pattern search β”œβ”€β”€ openalex/ # OpenAlex integration β”œβ”€β”€ climate/ # NOAA/NASA integration └── edgar/ # SEC EDGAR integration ``` ## Configuration Reference ### OptimizedConfig | Parameter | Default | Description | |-----------|---------|-------------| | `similarity_threshold` | 0.65 | Minimum cosine similarity for edges | | `mincut_sensitivity` | 0.12 | Sensitivity to coherence changes | | `cross_domain` | true | Enable cross-domain discovery | | `batch_size` | 256 | Parallel batch size | | `use_simd` | true | Enable SIMD acceleration | | `similarity_cache_size` | 10000 | Max cached similarity pairs | | `significance_threshold` | 0.05 | P-value threshold | | `causality_lookback` | 10 | Temporal lookback periods | | `causality_min_correlation` | 0.6 | Minimum correlation for causality | ### CoherenceConfig (v0.3.0) | Parameter | Default | Description | |-----------|---------|-------------| | `similarity_threshold` | 0.5 | Min similarity for auto-connecting embeddings | | `use_embeddings` | true | Auto-create edges from embedding similarity | | `hnsw_k_neighbors` | 50 | Neighbors to search per vector (HNSW) | | `hnsw_min_records` | 100 | Min records to trigger HNSW (else brute force) | | `min_edge_weight` | 0.01 | Minimum edge weight threshold | | `approximate` | true | Use approximate min-cut for speed | | `parallel` | true | Enable parallel computation | ## Discovery Examples ### Climate-Finance Bridge ``` Detected: Climate ↔ Finance bridge Strength: 0.73 Connections: 197 Hypothesis: Drought indices may predict utility sector performance with lag-2 ``` ### Regime Shift Detection ``` Min-cut trajectory: t=0: 72.5 (baseline) t=1: 73.3 (+1.1%) t=2: 74.5 (+1.6%) ← Consolidation Effect size: 2.99 (large) P-value: 0.042 (significant) ``` ### Causality Pattern ``` Climate β†’ Finance causality detected F-statistic: 4.23 Optimal lag: 3 periods Correlation: 0.67 P-value: 0.031 ``` ## Algorithms ### HNSW (Hierarchical Navigable Small World) Approximate nearest neighbor search in high-dimensional spaces. - **Complexity**: O(log n) search, O(log n) insert - **Use**: Fast similarity search for edge creation - **Parameters**: `m=16`, `ef_construction=200`, `ef_search=50` ### Stoer-Wagner Min-Cut Computes minimum cut of weighted undirected graph. - **Complexity**: O(VE + VΒ² log V) - **Use**: Network coherence measurement ### SIMD Cosine Similarity Processes 8 floats per iteration using AVX2. - **Speedup**: 2.9x vs scalar - **Fallback**: Chunked scalar (8 floats per iteration) ### Granger Causality Tests if past values of X predict Y. 1. Compute cross-correlation at lags 1..k 2. Find optimal lag with max |correlation| 3. Calculate F-statistic 4. Convert to p-value ## Best Practices 1. **Start with low thresholds** - Use `similarity_threshold: 0.45` for exploration 2. **Use batch insertion** - `add_vectors_batch()` is 8x faster 3. **Monitor coherence trends** - Min-cut trajectory predicts regime changes 4. **Filter by significance** - Focus on `p_value < 0.05` 5. **Validate causality** - Temporal patterns need domain expertise ## Troubleshooting | Problem | Solution | |---------|----------| | No patterns detected | Lower `mincut_sensitivity` to 0.05 | | Too many edges | Raise `similarity_threshold` to 0.70 | | Slow performance | Use `--features parallel --release` | | Memory issues | Reduce `batch_size` | ## References - [OpenAlex Documentation](https://docs.openalex.org/) - [NOAA Open Data](https://www.noaa.gov/information-technology/open-data-dissemination) - [NASA Earthdata](https://earthdata.nasa.gov/) - [SEC EDGAR](https://www.sec.gov/edgar) ## License MIT OR Apache-2.0