git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
11 KiB
Data Source Clients - Quick Reference
Summary Statistics
Total Clients: 30 across 12 modules Total Public Methods: 150+ Domain Coverage: 10 (News, Social, Research, Economic, Patent, Space, Genomics, Physics, Medical, Knowledge) Embedding Dimensions: 256 (standard), 384 (medical/scientific)
Client Index by Domain
News & Social (4 clients, 17 methods)
| Client | Endpoint | Auth | Rate Limit | Methods |
|---|---|---|---|---|
| News API | newsapi.org | Required | 100ms | 4 |
| reddit.com | Required | 1000ms | 5 | |
| GitHub | github.com | Optional | 1000ms | 4 |
| HackerNews | hacker-news.firebase | None | 100ms | 4 |
Economic & Financial (4 clients, 12 methods)
| Client | Endpoint | Auth | Rate Limit | Methods |
|---|---|---|---|---|
| World Bank | worldbank.org | None | 250ms | 3 |
| FRED | stlouisfed.org | Required | 200ms | 3 |
| Alpha Vantage | alphavantage.co | Required | 12000ms | 4 |
| IMF | imf.org | None | 500ms | 2 |
Patents (3 clients, 8 methods)
| Client | Endpoint | Auth | Rate Limit | Methods |
|---|---|---|---|---|
| USPTO | uspto.gov | None | 500ms | 3 |
| EPO | ops.epo.org | Required | 1000ms | 3 |
| Google Patents | patents.google.com | None | 1000ms | 2 |
Research Papers (4 clients, 19 methods)
| Client | Endpoint | Auth | Rate Limit | Methods |
|---|---|---|---|---|
| ArXiv | arxiv.org | None | 3000ms | 4 |
| Semantic Scholar | semanticscholar.org | Optional | 1000ms/100ms | 6 |
| bioRxiv | biorxiv.org | None | 500ms | 4 |
| medRxiv | medrxiv.org | None | 500ms | 4 |
| CrossRef | crossref.org | None | 200ms | 5 |
Space & Astronomy (3 clients, 10 methods)
| Client | Endpoint | Auth | Rate Limit | Methods |
|---|---|---|---|---|
| NASA APOD | api.nasa.gov | Optional | 1000ms | 3 |
| SpaceX | spacexdata.com | None | 500ms | 4 |
| SIMBAD | simbad.cds.unistra.fr | None | 1000ms | 3 |
Genomics & Proteomics (4 clients, 16 methods)
| Client | Endpoint | Auth | Rate Limit | Methods |
|---|---|---|---|---|
| NCBI Gene | ncbi.nlm.nih.gov | Optional | 334ms/100ms | 4 |
| Ensembl | ensembl.org | None | 200ms | 5 |
| UniProt | uniprot.org | None | 200ms | 4 |
| PDB | rcsb.org | None | 500ms | 3 |
Physics & Earth Science (4 clients, 14 methods)
| Client | Endpoint | Auth | Rate Limit | Methods |
|---|---|---|---|---|
| USGS Earthquake | earthquake.usgs.gov | None | 200ms | 5 |
| CERN Open Data | opendata.cern.ch | None | 500ms | 3 |
| Argo Ocean | data-argo.ifremer.fr | None | 300ms | 4 |
| Materials Project | materialsproject.org | Required | 1000ms | 3 |
Knowledge Graphs (2 clients, 11 methods)
| Client | Endpoint | Auth | Rate Limit | Methods |
|---|---|---|---|---|
| Wikipedia | wikipedia.org | None | 100ms | 4 |
| Wikidata | wikidata.org | None | 100ms | 7 |
Medical & Health (3 clients, 9 methods)
| Client | Endpoint | Auth | Rate Limit | Methods |
|---|---|---|---|---|
| PubMed | ncbi.nlm.nih.gov | Optional | 334ms/100ms | 4 |
| ClinicalTrials | clinicaltrials.gov | None | 100ms | 2 |
| FDA OpenFDA | fda.gov | None | 250ms | 3 |
Rate Limiting Quick Reference
Strictest Limits (Use Sparingly)
- Alpha Vantage: 12000ms (5 req/min, 500/day)
- ArXiv: 3000ms (1 req/3sec per guidelines)
Standard Limits (Typical Usage)
- 1000ms: Reddit, GitHub, EPO, Google Patents, SIMBAD, NASA, Materials Project
- 500ms: USPTO, bioRxiv, medRxiv, IMF, SpaceX, PDB, CERN
Fast Limits (High-Volume OK)
- 100-200ms: News API, HackerNews, FRED, CrossRef, Ensembl, UniProt, Wikipedia, Wikidata, ClinicalTrials
- With API Key: NCBI Gene, PubMed, Semantic Scholar drop to 100ms
Authentication Quick Reference
No Auth Required (17 clients)
World Bank, IMF, USPTO, Google Patents, ArXiv, bioRxiv, medRxiv, CrossRef, SpaceX, SIMBAD, Ensembl, UniProt, PDB, USGS, CERN, Argo, Wikipedia, Wikidata, ClinicalTrials, FDA
Optional Auth (Higher Limits) (5 clients)
GitHub, Semantic Scholar, NASA APOD, NCBI Gene, PubMed
Required Auth (8 clients)
News API, Reddit, FRED, Alpha Vantage, EPO, Materials Project
Method Count by Category
Search Methods
- Text Search: All 30 clients support text-based search
- ID Lookup: 22 clients support direct ID/identifier lookup
- Advanced Filters: 18 clients support filtered searches (date, category, status, etc.)
- Batch Operations: 4 clients (PubMed, NCBI Gene, ArXiv, Semantic Scholar)
Specialized Methods
- Time-Series: World Bank, FRED, Alpha Vantage (economic data)
- Geographic: USGS (earthquakes), Argo (ocean), SIMBAD (sky coordinates)
- Graph Traversal: Semantic Scholar (citations/references), Wikipedia (categories/links), Wikidata (SPARQL)
- Relationships: Wikipedia (15 avg links/article), Wikidata (structured claims)
Data Transformation Patterns
SemanticVector Output
SemanticVector {
id: "SOURCE:identifier", // Unique ID with source prefix
embedding: Vec<f32>, // 256 or 384 dimensions
domain: Domain::*, // News, Research, Medical, etc.
timestamp: DateTime<Utc>, // Publication/event date
metadata: HashMap<String, String> // Source-specific fields
}
DataRecord Output (Wikipedia, Wikidata)
DataRecord {
id: "source_identifier",
source: "wikipedia|wikidata",
record_type: "article|entity",
timestamp: DateTime<Utc>,
data: serde_json::Value, // Full structured data
embedding: Option<Vec<f32>>, // Optional embeddings
relationships: Vec<Relationship> // Graph connections
}
Domain Classification
Domain::News
News API, HackerNews
Domain::Social
Reddit, GitHub
Domain::Research
ArXiv, Semantic Scholar, bioRxiv, medRxiv, CrossRef
Domain::Economic
World Bank, FRED, Alpha Vantage, IMF
Domain::Patent
USPTO, EPO, Google Patents
Domain::Space
NASA APOD, SpaceX, SIMBAD
Domain::Genomics
NCBI Gene, Ensembl, UniProt
Domain::Protein
PDB
Domain::Seismic
USGS Earthquake
Domain::Ocean
Argo
Domain::Physics
CERN Open Data, Materials Project
Domain::Medical
PubMed, ClinicalTrials, FDA
Error Handling
All clients implement:
Retry Logic
- Max Retries: 3
- Base Delay: 1000ms
- Backoff: Exponential (delay × retry_count)
- Triggers: Network errors, HTTP 429 (Too Many Requests)
Error Types
FrameworkError::Network(reqwest::Error) // Connection issues
FrameworkError::Config(String) // Configuration/parsing errors
FrameworkError::Discovery(String) // Data not found
Graceful Degradation
- Returns empty Vec on 404 (no results)
- Continues on partial failures in batch operations
- Logs warnings for rate limit hits
Embedding Configuration
Standard (256 dimensions)
Used by: News, Social, Economic, Patent, Research, Space, Physics clients
- Good for general text, titles, abstracts
- Fast computation
- Lower memory footprint
Enhanced (384 dimensions)
Used by: Medical clients (PubMed, ClinicalTrials, FDA)
- Richer semantic representation
- Better for technical/medical terminology
- Higher accuracy for domain-specific searches
Implementation
SimpleEmbedder::new(dimension: usize)
// Deterministic hash-based embeddings
// Consistent across runs
// No external model dependencies
Usage Patterns
Single Source Query
let client = ArxivClient::new()?;
let papers = client.search("quantum computing", 50).await?;
Multi-Source Aggregation
let (arxiv, s2, pubmed) = tokio::join!(
arxiv_client.search(query, 50),
s2_client.search_papers(query, 50),
pubmed_client.search_articles(query, 50)
);
Filtered Search
// ClinicalTrials by status
let trials = ct_client.search_trials("diabetes", Some("RECRUITING")).await?;
// ArXiv by category
let papers = arxiv_client.search_by_category("cs.AI", 100).await?;
// USGS by magnitude range
let quakes = usgs_client.get_by_magnitude_range(4.0, 6.0, 30).await?;
Batch Retrieval
// PubMed: Fetch up to 200 abstracts per request
let pmids = vec!["12345678", "87654321", ...];
let abstracts = pubmed_client.fetch_abstracts(&pmids).await?;
Performance Tips
-
Rate Limit Management
- Use API keys when available (10x speed boost for NCBI, Semantic Scholar)
- Batch requests when supported (PubMed, NCBI Gene)
- Parallel queries to independent sources
-
Caching Strategy
- Cache immutable data (historical papers, patents)
- Short TTL for dynamic data (news, social media)
- Store embeddings to avoid recomputation
-
Query Optimization
- Use specific filters to reduce result size
- Leverage ID lookups over full-text search when possible
- For knowledge graphs (Wikidata), use SPARQL for complex queries
-
Resource Management
- Reuse HTTP clients (already implemented via Arc)
- Consider connection pooling for high-volume usage
- Monitor rate limit headers (future enhancement)
Common Use Cases
Academic Research
- ArXiv + Semantic Scholar + CrossRef: Comprehensive paper discovery
- PubMed + bioRxiv: Medical/biomedical research
- NCBI Gene + Ensembl + UniProt: Genomics research
Market Intelligence
- World Bank + FRED + IMF: Macroeconomic analysis
- Alpha Vantage: Stock market data
- USPTO + EPO: Patent landscape analysis
News Aggregation
- News API: Current events
- Reddit + HackerNews: Tech community discussions
- GitHub: Developer activity
Scientific Data
- USGS: Earthquake monitoring
- CERN: Particle physics datasets
- Materials Project: Computational materials science
- Argo: Ocean climate data
Knowledge Discovery
- Wikipedia: Structured articles with categories
- Wikidata: Entity relationships via SPARQL
- Semantic Scholar: Citation network analysis
File Locations
| File | Clients | LOC |
|---|---|---|
api_clients.rs |
News, Reddit, GitHub, HackerNews | ~800 |
economic_clients.rs |
World Bank, FRED, Alpha Vantage, IMF | ~600 |
patent_clients.rs |
USPTO, EPO, Google Patents | ~500 |
arxiv_client.rs |
ArXiv | ~300 |
semantic_scholar.rs |
Semantic Scholar | ~400 |
biorxiv_client.rs |
bioRxiv, medRxiv | ~400 |
crossref_client.rs |
CrossRef | ~300 |
space_clients.rs |
NASA, SpaceX, SIMBAD | ~600 |
genomics_clients.rs |
NCBI Gene, Ensembl, UniProt, PDB | ~900 |
physics_clients.rs |
USGS, CERN, Argo, Materials Project | ~1200 |
wiki_clients.rs |
Wikipedia, Wikidata | ~900 |
medical_clients.rs |
PubMed, ClinicalTrials, FDA | ~900 |
Total: ~7,800 lines of client implementation code
Next Steps
- Review full inventory:
/home/user/ruvector/examples/data/framework/docs/API_CLIENTS_INVENTORY.md - Check example usage:
/home/user/ruvector/examples/data/framework/examples/ - Run tests:
cargo test --features data-framework - API key setup: Store in environment variables for optimal performance
Generated: 2026-01-04 Framework Version: RuVector Data Framework v0.1.0