Files
wifi-densepose/examples/data/framework/docs/CLIENTS_QUICK_REFERENCE.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

11 KiB
Raw Blame History

Data Source Clients - Quick Reference

Summary Statistics

Total Clients: 30 across 12 modules Total Public Methods: 150+ Domain Coverage: 10 (News, Social, Research, Economic, Patent, Space, Genomics, Physics, Medical, Knowledge) Embedding Dimensions: 256 (standard), 384 (medical/scientific)


Client Index by Domain

News & Social (4 clients, 17 methods)

Client Endpoint Auth Rate Limit Methods
News API newsapi.org Required 100ms 4
Reddit reddit.com Required 1000ms 5
GitHub github.com Optional 1000ms 4
HackerNews hacker-news.firebase None 100ms 4

Economic & Financial (4 clients, 12 methods)

Client Endpoint Auth Rate Limit Methods
World Bank worldbank.org None 250ms 3
FRED stlouisfed.org Required 200ms 3
Alpha Vantage alphavantage.co Required 12000ms 4
IMF imf.org None 500ms 2

Patents (3 clients, 8 methods)

Client Endpoint Auth Rate Limit Methods
USPTO uspto.gov None 500ms 3
EPO ops.epo.org Required 1000ms 3
Google Patents patents.google.com None 1000ms 2

Research Papers (4 clients, 19 methods)

Client Endpoint Auth Rate Limit Methods
ArXiv arxiv.org None 3000ms 4
Semantic Scholar semanticscholar.org Optional 1000ms/100ms 6
bioRxiv biorxiv.org None 500ms 4
medRxiv medrxiv.org None 500ms 4
CrossRef crossref.org None 200ms 5

Space & Astronomy (3 clients, 10 methods)

Client Endpoint Auth Rate Limit Methods
NASA APOD api.nasa.gov Optional 1000ms 3
SpaceX spacexdata.com None 500ms 4
SIMBAD simbad.cds.unistra.fr None 1000ms 3

Genomics & Proteomics (4 clients, 16 methods)

Client Endpoint Auth Rate Limit Methods
NCBI Gene ncbi.nlm.nih.gov Optional 334ms/100ms 4
Ensembl ensembl.org None 200ms 5
UniProt uniprot.org None 200ms 4
PDB rcsb.org None 500ms 3

Physics & Earth Science (4 clients, 14 methods)

Client Endpoint Auth Rate Limit Methods
USGS Earthquake earthquake.usgs.gov None 200ms 5
CERN Open Data opendata.cern.ch None 500ms 3
Argo Ocean data-argo.ifremer.fr None 300ms 4
Materials Project materialsproject.org Required 1000ms 3

Knowledge Graphs (2 clients, 11 methods)

Client Endpoint Auth Rate Limit Methods
Wikipedia wikipedia.org None 100ms 4
Wikidata wikidata.org None 100ms 7

Medical & Health (3 clients, 9 methods)

Client Endpoint Auth Rate Limit Methods
PubMed ncbi.nlm.nih.gov Optional 334ms/100ms 4
ClinicalTrials clinicaltrials.gov None 100ms 2
FDA OpenFDA fda.gov None 250ms 3

Rate Limiting Quick Reference

Strictest Limits (Use Sparingly)

  • Alpha Vantage: 12000ms (5 req/min, 500/day)
  • ArXiv: 3000ms (1 req/3sec per guidelines)

Standard Limits (Typical Usage)

  • 1000ms: Reddit, GitHub, EPO, Google Patents, SIMBAD, NASA, Materials Project
  • 500ms: USPTO, bioRxiv, medRxiv, IMF, SpaceX, PDB, CERN

Fast Limits (High-Volume OK)

  • 100-200ms: News API, HackerNews, FRED, CrossRef, Ensembl, UniProt, Wikipedia, Wikidata, ClinicalTrials
  • With API Key: NCBI Gene, PubMed, Semantic Scholar drop to 100ms

Authentication Quick Reference

No Auth Required (17 clients)

World Bank, IMF, USPTO, Google Patents, ArXiv, bioRxiv, medRxiv, CrossRef, SpaceX, SIMBAD, Ensembl, UniProt, PDB, USGS, CERN, Argo, Wikipedia, Wikidata, ClinicalTrials, FDA

Optional Auth (Higher Limits) (5 clients)

GitHub, Semantic Scholar, NASA APOD, NCBI Gene, PubMed

Required Auth (8 clients)

News API, Reddit, FRED, Alpha Vantage, EPO, Materials Project


Method Count by Category

Search Methods

  • Text Search: All 30 clients support text-based search
  • ID Lookup: 22 clients support direct ID/identifier lookup
  • Advanced Filters: 18 clients support filtered searches (date, category, status, etc.)
  • Batch Operations: 4 clients (PubMed, NCBI Gene, ArXiv, Semantic Scholar)

Specialized Methods

  • Time-Series: World Bank, FRED, Alpha Vantage (economic data)
  • Geographic: USGS (earthquakes), Argo (ocean), SIMBAD (sky coordinates)
  • Graph Traversal: Semantic Scholar (citations/references), Wikipedia (categories/links), Wikidata (SPARQL)
  • Relationships: Wikipedia (15 avg links/article), Wikidata (structured claims)

Data Transformation Patterns

SemanticVector Output

SemanticVector {
    id: "SOURCE:identifier",      // Unique ID with source prefix
    embedding: Vec<f32>,           // 256 or 384 dimensions
    domain: Domain::*,             // News, Research, Medical, etc.
    timestamp: DateTime<Utc>,      // Publication/event date
    metadata: HashMap<String, String>  // Source-specific fields
}

DataRecord Output (Wikipedia, Wikidata)

DataRecord {
    id: "source_identifier",
    source: "wikipedia|wikidata",
    record_type: "article|entity",
    timestamp: DateTime<Utc>,
    data: serde_json::Value,       // Full structured data
    embedding: Option<Vec<f32>>,   // Optional embeddings
    relationships: Vec<Relationship>  // Graph connections
}

Domain Classification

Domain::News

News API, HackerNews

Domain::Social

Reddit, GitHub

Domain::Research

ArXiv, Semantic Scholar, bioRxiv, medRxiv, CrossRef

Domain::Economic

World Bank, FRED, Alpha Vantage, IMF

Domain::Patent

USPTO, EPO, Google Patents

Domain::Space

NASA APOD, SpaceX, SIMBAD

Domain::Genomics

NCBI Gene, Ensembl, UniProt

Domain::Protein

PDB

Domain::Seismic

USGS Earthquake

Domain::Ocean

Argo

Domain::Physics

CERN Open Data, Materials Project

Domain::Medical

PubMed, ClinicalTrials, FDA


Error Handling

All clients implement:

Retry Logic

  • Max Retries: 3
  • Base Delay: 1000ms
  • Backoff: Exponential (delay × retry_count)
  • Triggers: Network errors, HTTP 429 (Too Many Requests)

Error Types

FrameworkError::Network(reqwest::Error)  // Connection issues
FrameworkError::Config(String)           // Configuration/parsing errors
FrameworkError::Discovery(String)        // Data not found

Graceful Degradation

  • Returns empty Vec on 404 (no results)
  • Continues on partial failures in batch operations
  • Logs warnings for rate limit hits

Embedding Configuration

Standard (256 dimensions)

Used by: News, Social, Economic, Patent, Research, Space, Physics clients

  • Good for general text, titles, abstracts
  • Fast computation
  • Lower memory footprint

Enhanced (384 dimensions)

Used by: Medical clients (PubMed, ClinicalTrials, FDA)

  • Richer semantic representation
  • Better for technical/medical terminology
  • Higher accuracy for domain-specific searches

Implementation

SimpleEmbedder::new(dimension: usize)
// Deterministic hash-based embeddings
// Consistent across runs
// No external model dependencies

Usage Patterns

Single Source Query

let client = ArxivClient::new()?;
let papers = client.search("quantum computing", 50).await?;

Multi-Source Aggregation

let (arxiv, s2, pubmed) = tokio::join!(
    arxiv_client.search(query, 50),
    s2_client.search_papers(query, 50),
    pubmed_client.search_articles(query, 50)
);
// ClinicalTrials by status
let trials = ct_client.search_trials("diabetes", Some("RECRUITING")).await?;

// ArXiv by category
let papers = arxiv_client.search_by_category("cs.AI", 100).await?;

// USGS by magnitude range
let quakes = usgs_client.get_by_magnitude_range(4.0, 6.0, 30).await?;

Batch Retrieval

// PubMed: Fetch up to 200 abstracts per request
let pmids = vec!["12345678", "87654321", ...];
let abstracts = pubmed_client.fetch_abstracts(&pmids).await?;

Performance Tips

  1. Rate Limit Management

    • Use API keys when available (10x speed boost for NCBI, Semantic Scholar)
    • Batch requests when supported (PubMed, NCBI Gene)
    • Parallel queries to independent sources
  2. Caching Strategy

    • Cache immutable data (historical papers, patents)
    • Short TTL for dynamic data (news, social media)
    • Store embeddings to avoid recomputation
  3. Query Optimization

    • Use specific filters to reduce result size
    • Leverage ID lookups over full-text search when possible
    • For knowledge graphs (Wikidata), use SPARQL for complex queries
  4. Resource Management

    • Reuse HTTP clients (already implemented via Arc)
    • Consider connection pooling for high-volume usage
    • Monitor rate limit headers (future enhancement)

Common Use Cases

Academic Research

  • ArXiv + Semantic Scholar + CrossRef: Comprehensive paper discovery
  • PubMed + bioRxiv: Medical/biomedical research
  • NCBI Gene + Ensembl + UniProt: Genomics research

Market Intelligence

  • World Bank + FRED + IMF: Macroeconomic analysis
  • Alpha Vantage: Stock market data
  • USPTO + EPO: Patent landscape analysis

News Aggregation

  • News API: Current events
  • Reddit + HackerNews: Tech community discussions
  • GitHub: Developer activity

Scientific Data

  • USGS: Earthquake monitoring
  • CERN: Particle physics datasets
  • Materials Project: Computational materials science
  • Argo: Ocean climate data

Knowledge Discovery

  • Wikipedia: Structured articles with categories
  • Wikidata: Entity relationships via SPARQL
  • Semantic Scholar: Citation network analysis

File Locations

File Clients LOC
api_clients.rs News, Reddit, GitHub, HackerNews ~800
economic_clients.rs World Bank, FRED, Alpha Vantage, IMF ~600
patent_clients.rs USPTO, EPO, Google Patents ~500
arxiv_client.rs ArXiv ~300
semantic_scholar.rs Semantic Scholar ~400
biorxiv_client.rs bioRxiv, medRxiv ~400
crossref_client.rs CrossRef ~300
space_clients.rs NASA, SpaceX, SIMBAD ~600
genomics_clients.rs NCBI Gene, Ensembl, UniProt, PDB ~900
physics_clients.rs USGS, CERN, Argo, Materials Project ~1200
wiki_clients.rs Wikipedia, Wikidata ~900
medical_clients.rs PubMed, ClinicalTrials, FDA ~900

Total: ~7,800 lines of client implementation code


Next Steps

  1. Review full inventory: /home/user/ruvector/examples/data/framework/docs/API_CLIENTS_INVENTORY.md
  2. Check example usage: /home/user/ruvector/examples/data/framework/examples/
  3. Run tests: cargo test --features data-framework
  4. API key setup: Store in environment variables for optimal performance

Generated: 2026-01-04 Framework Version: RuVector Data Framework v0.1.0