Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
This commit is contained in:
ruv
2026-02-28 14:39:40 -05:00
commit d803bfe2b1
7854 changed files with 3522914 additions and 0 deletions

View File

@@ -0,0 +1,368 @@
# Data Source Clients - Quick Reference
## Summary Statistics
**Total Clients**: 30 across 12 modules
**Total Public Methods**: 150+
**Domain Coverage**: 10 (News, Social, Research, Economic, Patent, Space, Genomics, Physics, Medical, Knowledge)
**Embedding Dimensions**: 256 (standard), 384 (medical/scientific)
---
## Client Index by Domain
### News & Social (4 clients, 17 methods)
| Client | Endpoint | Auth | Rate Limit | Methods |
|--------|----------|------|------------|---------|
| News API | newsapi.org | Required | 100ms | 4 |
| Reddit | reddit.com | Required | 1000ms | 5 |
| GitHub | github.com | Optional | 1000ms | 4 |
| HackerNews | hacker-news.firebase | None | 100ms | 4 |
### Economic & Financial (4 clients, 12 methods)
| Client | Endpoint | Auth | Rate Limit | Methods |
|--------|----------|------|------------|---------|
| World Bank | worldbank.org | None | 250ms | 3 |
| FRED | stlouisfed.org | Required | 200ms | 3 |
| Alpha Vantage | alphavantage.co | Required | 12000ms | 4 |
| IMF | imf.org | None | 500ms | 2 |
### Patents (3 clients, 8 methods)
| Client | Endpoint | Auth | Rate Limit | Methods |
|--------|----------|------|------------|---------|
| USPTO | uspto.gov | None | 500ms | 3 |
| EPO | ops.epo.org | Required | 1000ms | 3 |
| Google Patents | patents.google.com | None | 1000ms | 2 |
### Research Papers (4 clients, 19 methods)
| Client | Endpoint | Auth | Rate Limit | Methods |
|--------|----------|------|------------|---------|
| ArXiv | arxiv.org | None | 3000ms | 4 |
| Semantic Scholar | semanticscholar.org | Optional | 1000ms/100ms | 6 |
| bioRxiv | biorxiv.org | None | 500ms | 4 |
| medRxiv | medrxiv.org | None | 500ms | 4 |
| CrossRef | crossref.org | None | 200ms | 5 |
### Space & Astronomy (3 clients, 10 methods)
| Client | Endpoint | Auth | Rate Limit | Methods |
|--------|----------|------|------------|---------|
| NASA APOD | api.nasa.gov | Optional | 1000ms | 3 |
| SpaceX | spacexdata.com | None | 500ms | 4 |
| SIMBAD | simbad.cds.unistra.fr | None | 1000ms | 3 |
### Genomics & Proteomics (4 clients, 16 methods)
| Client | Endpoint | Auth | Rate Limit | Methods |
|--------|----------|------|------------|---------|
| NCBI Gene | ncbi.nlm.nih.gov | Optional | 334ms/100ms | 4 |
| Ensembl | ensembl.org | None | 200ms | 5 |
| UniProt | uniprot.org | None | 200ms | 4 |
| PDB | rcsb.org | None | 500ms | 3 |
### Physics & Earth Science (4 clients, 14 methods)
| Client | Endpoint | Auth | Rate Limit | Methods |
|--------|----------|------|------------|---------|
| USGS Earthquake | earthquake.usgs.gov | None | 200ms | 5 |
| CERN Open Data | opendata.cern.ch | None | 500ms | 3 |
| Argo Ocean | data-argo.ifremer.fr | None | 300ms | 4 |
| Materials Project | materialsproject.org | Required | 1000ms | 3 |
### Knowledge Graphs (2 clients, 11 methods)
| Client | Endpoint | Auth | Rate Limit | Methods |
|--------|----------|------|------------|---------|
| Wikipedia | wikipedia.org | None | 100ms | 4 |
| Wikidata | wikidata.org | None | 100ms | 7 |
### Medical & Health (3 clients, 9 methods)
| Client | Endpoint | Auth | Rate Limit | Methods |
|--------|----------|------|------------|---------|
| PubMed | ncbi.nlm.nih.gov | Optional | 334ms/100ms | 4 |
| ClinicalTrials | clinicaltrials.gov | None | 100ms | 2 |
| FDA OpenFDA | fda.gov | None | 250ms | 3 |
---
## Rate Limiting Quick Reference
### Strictest Limits (Use Sparingly)
- **Alpha Vantage**: 12000ms (5 req/min, 500/day)
- **ArXiv**: 3000ms (1 req/3sec per guidelines)
### Standard Limits (Typical Usage)
- **1000ms**: Reddit, GitHub, EPO, Google Patents, SIMBAD, NASA, Materials Project
- **500ms**: USPTO, bioRxiv, medRxiv, IMF, SpaceX, PDB, CERN
### Fast Limits (High-Volume OK)
- **100-200ms**: News API, HackerNews, FRED, CrossRef, Ensembl, UniProt, Wikipedia, Wikidata, ClinicalTrials
- **With API Key**: NCBI Gene, PubMed, Semantic Scholar drop to 100ms
---
## Authentication Quick Reference
### No Auth Required (17 clients)
World Bank, IMF, USPTO, Google Patents, ArXiv, bioRxiv, medRxiv, CrossRef, SpaceX, SIMBAD, Ensembl, UniProt, PDB, USGS, CERN, Argo, Wikipedia, Wikidata, ClinicalTrials, FDA
### Optional Auth (Higher Limits) (5 clients)
GitHub, Semantic Scholar, NASA APOD, NCBI Gene, PubMed
### Required Auth (8 clients)
News API, Reddit, FRED, Alpha Vantage, EPO, Materials Project
---
## Method Count by Category
### Search Methods
- **Text Search**: All 30 clients support text-based search
- **ID Lookup**: 22 clients support direct ID/identifier lookup
- **Advanced Filters**: 18 clients support filtered searches (date, category, status, etc.)
- **Batch Operations**: 4 clients (PubMed, NCBI Gene, ArXiv, Semantic Scholar)
### Specialized Methods
- **Time-Series**: World Bank, FRED, Alpha Vantage (economic data)
- **Geographic**: USGS (earthquakes), Argo (ocean), SIMBAD (sky coordinates)
- **Graph Traversal**: Semantic Scholar (citations/references), Wikipedia (categories/links), Wikidata (SPARQL)
- **Relationships**: Wikipedia (15 avg links/article), Wikidata (structured claims)
---
## Data Transformation Patterns
### SemanticVector Output
```rust
SemanticVector {
id: "SOURCE:identifier", // Unique ID with source prefix
embedding: Vec<f32>, // 256 or 384 dimensions
domain: Domain::*, // News, Research, Medical, etc.
timestamp: DateTime<Utc>, // Publication/event date
metadata: HashMap<String, String> // Source-specific fields
}
```
### DataRecord Output (Wikipedia, Wikidata)
```rust
DataRecord {
id: "source_identifier",
source: "wikipedia|wikidata",
record_type: "article|entity",
timestamp: DateTime<Utc>,
data: serde_json::Value, // Full structured data
embedding: Option<Vec<f32>>, // Optional embeddings
relationships: Vec<Relationship> // Graph connections
}
```
---
## Domain Classification
### Domain::News
News API, HackerNews
### Domain::Social
Reddit, GitHub
### Domain::Research
ArXiv, Semantic Scholar, bioRxiv, medRxiv, CrossRef
### Domain::Economic
World Bank, FRED, Alpha Vantage, IMF
### Domain::Patent
USPTO, EPO, Google Patents
### Domain::Space
NASA APOD, SpaceX, SIMBAD
### Domain::Genomics
NCBI Gene, Ensembl, UniProt
### Domain::Protein
PDB
### Domain::Seismic
USGS Earthquake
### Domain::Ocean
Argo
### Domain::Physics
CERN Open Data, Materials Project
### Domain::Medical
PubMed, ClinicalTrials, FDA
---
## Error Handling
All clients implement:
### Retry Logic
- **Max Retries**: 3
- **Base Delay**: 1000ms
- **Backoff**: Exponential (delay × retry_count)
- **Triggers**: Network errors, HTTP 429 (Too Many Requests)
### Error Types
```rust
FrameworkError::Network(reqwest::Error) // Connection issues
FrameworkError::Config(String) // Configuration/parsing errors
FrameworkError::Discovery(String) // Data not found
```
### Graceful Degradation
- Returns empty Vec on 404 (no results)
- Continues on partial failures in batch operations
- Logs warnings for rate limit hits
---
## Embedding Configuration
### Standard (256 dimensions)
Used by: News, Social, Economic, Patent, Research, Space, Physics clients
- Good for general text, titles, abstracts
- Fast computation
- Lower memory footprint
### Enhanced (384 dimensions)
Used by: Medical clients (PubMed, ClinicalTrials, FDA)
- Richer semantic representation
- Better for technical/medical terminology
- Higher accuracy for domain-specific searches
### Implementation
```rust
SimpleEmbedder::new(dimension: usize)
// Deterministic hash-based embeddings
// Consistent across runs
// No external model dependencies
```
---
## Usage Patterns
### Single Source Query
```rust
let client = ArxivClient::new()?;
let papers = client.search("quantum computing", 50).await?;
```
### Multi-Source Aggregation
```rust
let (arxiv, s2, pubmed) = tokio::join!(
arxiv_client.search(query, 50),
s2_client.search_papers(query, 50),
pubmed_client.search_articles(query, 50)
);
```
### Filtered Search
```rust
// ClinicalTrials by status
let trials = ct_client.search_trials("diabetes", Some("RECRUITING")).await?;
// ArXiv by category
let papers = arxiv_client.search_by_category("cs.AI", 100).await?;
// USGS by magnitude range
let quakes = usgs_client.get_by_magnitude_range(4.0, 6.0, 30).await?;
```
### Batch Retrieval
```rust
// PubMed: Fetch up to 200 abstracts per request
let pmids = vec!["12345678", "87654321", ...];
let abstracts = pubmed_client.fetch_abstracts(&pmids).await?;
```
---
## Performance Tips
1. **Rate Limit Management**
- Use API keys when available (10x speed boost for NCBI, Semantic Scholar)
- Batch requests when supported (PubMed, NCBI Gene)
- Parallel queries to independent sources
2. **Caching Strategy**
- Cache immutable data (historical papers, patents)
- Short TTL for dynamic data (news, social media)
- Store embeddings to avoid recomputation
3. **Query Optimization**
- Use specific filters to reduce result size
- Leverage ID lookups over full-text search when possible
- For knowledge graphs (Wikidata), use SPARQL for complex queries
4. **Resource Management**
- Reuse HTTP clients (already implemented via Arc)
- Consider connection pooling for high-volume usage
- Monitor rate limit headers (future enhancement)
---
## Common Use Cases
### Academic Research
- **ArXiv + Semantic Scholar + CrossRef**: Comprehensive paper discovery
- **PubMed + bioRxiv**: Medical/biomedical research
- **NCBI Gene + Ensembl + UniProt**: Genomics research
### Market Intelligence
- **World Bank + FRED + IMF**: Macroeconomic analysis
- **Alpha Vantage**: Stock market data
- **USPTO + EPO**: Patent landscape analysis
### News Aggregation
- **News API**: Current events
- **Reddit + HackerNews**: Tech community discussions
- **GitHub**: Developer activity
### Scientific Data
- **USGS**: Earthquake monitoring
- **CERN**: Particle physics datasets
- **Materials Project**: Computational materials science
- **Argo**: Ocean climate data
### Knowledge Discovery
- **Wikipedia**: Structured articles with categories
- **Wikidata**: Entity relationships via SPARQL
- **Semantic Scholar**: Citation network analysis
---
## File Locations
| File | Clients | LOC |
|------|---------|-----|
| `api_clients.rs` | News, Reddit, GitHub, HackerNews | ~800 |
| `economic_clients.rs` | World Bank, FRED, Alpha Vantage, IMF | ~600 |
| `patent_clients.rs` | USPTO, EPO, Google Patents | ~500 |
| `arxiv_client.rs` | ArXiv | ~300 |
| `semantic_scholar.rs` | Semantic Scholar | ~400 |
| `biorxiv_client.rs` | bioRxiv, medRxiv | ~400 |
| `crossref_client.rs` | CrossRef | ~300 |
| `space_clients.rs` | NASA, SpaceX, SIMBAD | ~600 |
| `genomics_clients.rs` | NCBI Gene, Ensembl, UniProt, PDB | ~900 |
| `physics_clients.rs` | USGS, CERN, Argo, Materials Project | ~1200 |
| `wiki_clients.rs` | Wikipedia, Wikidata | ~900 |
| `medical_clients.rs` | PubMed, ClinicalTrials, FDA | ~900 |
**Total**: ~7,800 lines of client implementation code
---
## Next Steps
1. Review full inventory: `/home/user/ruvector/examples/data/framework/docs/API_CLIENTS_INVENTORY.md`
2. Check example usage: `/home/user/ruvector/examples/data/framework/examples/`
3. Run tests: `cargo test --features data-framework`
4. API key setup: Store in environment variables for optimal performance
---
**Generated**: 2026-01-04
**Framework Version**: RuVector Data Framework v0.1.0