git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
9.9 KiB
News & Social Media API Clients
Comprehensive Rust client module for News & Social APIs, following TDD approach and RuVector patterns.
Overview
This module provides async clients for fetching data from news and social media APIs, converting responses into RuVector's DataRecord format with semantic embeddings.
Implemented Clients
1. HackerNewsClient
Base URL: https://hacker-news.firebaseio.com/v0
Features:
- ✅
get_top_stories(limit)- Top story IDs - ✅
get_new_stories(limit)- New stories - ✅
get_best_stories(limit)- Best stories - ✅
get_item(id)- Get story/comment by ID - ✅
get_user(username)- User profile
Authentication: None required
Rate Limits: Generous (no strict limits)
Status: ✅ Fully working with real data
use ruvector_data_framework::HackerNewsClient;
let client = HackerNewsClient::new()?;
let stories = client.get_top_stories(10).await?;
2. GuardianClient
Base URL: https://content.guardianapis.com
Features:
- ✅
search(query, limit)- Search articles - ✅
get_article(id)- Get article by ID - ✅
get_sections()- List sections - ✅
search_by_tag(tag, limit)- Tag-based search
Authentication: API key required (GUARDIAN_API_KEY)
Rate Limits: Free tier - 12 calls/sec, 5000/day
Mock Fallback: ✅ Synthetic data when no API key
Get API Key: https://open-platform.theguardian.com/
use ruvector_data_framework::GuardianClient;
let client = GuardianClient::new(Some("your_api_key".to_string()))?;
let articles = client.search("technology", 10).await?;
3. NewsDataClient
Base URL: https://newsdata.io/api/1
Features:
- ✅
get_latest(query, country, category)- Latest news - ✅
get_archive(query, from_date, to_date)- Historical news
Authentication: API key required (NEWSDATA_API_KEY)
Rate Limits: Free tier - 200 requests/day
Mock Fallback: ✅ Synthetic data when no API key
Get API Key: https://newsdata.io/
use ruvector_data_framework::NewsDataClient;
let client = NewsDataClient::new(Some("your_api_key".to_string()))?;
let news = client.get_latest(Some("AI"), Some("us"), Some("technology")).await?;
4. RedditClient
Base URL: https://www.reddit.com (JSON endpoints)
Features:
- ✅
get_subreddit_posts(subreddit, sort, limit)- Subreddit posts - ✅
get_post_comments(post_id)- Post comments - ✅
search(query, subreddit, limit)- Search posts
Authentication: None (uses public .json endpoints)
Rate Limits: Be respectful (1 req/sec implemented)
Special Handling: ✅ Reddit's .json suffix pattern
use ruvector_data_framework::RedditClient;
let client = RedditClient::new()?;
let posts = client.get_subreddit_posts("programming", "hot", 10).await?;
Architecture
Data Flow
API Response → Deserialize → Convert to DataRecord → Generate Embedding → Return
Key Components
- Response Structures: Serde deserialization for API JSON responses
- Conversion Methods:
*_to_record()methods convert API data toDataRecord - Embedding Generation: Uses
SimpleEmbedder(128-dim bag-of-words) - Retry Logic: Exponential backoff with 3 max retries
- Rate Limiting: Client-specific delays to respect API limits
DataRecord Structure
pub struct DataRecord {
pub id: String, // Unique ID
pub source: String, // "hackernews", "guardian", etc.
pub record_type: String, // "story", "article", "post", etc.
pub timestamp: DateTime<Utc>, // Publication time
pub data: serde_json::Value, // Raw data
pub embedding: Option<Vec<f32>>, // 128-dim semantic vector
pub relationships: Vec<Relationship>, // Graph relationships
}
Testing
Test Coverage
- ✅ 16 comprehensive tests (all passing)
- Client creation tests
- Conversion function tests
- Synthetic data generation tests
- Embedding normalization tests
- Timestamp parsing tests
Run Tests
# Run all news_clients tests
cargo test news_clients --lib
# Run specific test
cargo test news_clients::tests::test_hackernews_item_conversion
# Run with output
cargo test news_clients --lib -- --nocapture
Test Results
test result: ok. 16 passed; 0 failed; 0 ignored
Demo Example
Run the comprehensive demo:
# Basic demo (uses HackerNews without auth)
cargo run --example news_social_demo
# With API keys
GUARDIAN_API_KEY=your_key \
NEWSDATA_API_KEY=your_key \
cargo run --example news_social_demo
Demo Output:
- Fetches top HackerNews stories
- Searches Guardian articles
- Gets latest NewsData news
- Retrieves Reddit posts
- Shows embedding information
Implementation Patterns
Following api_clients.rs Patterns
✅ Async/await with tokio
pub async fn get_top_stories(&self, limit: usize) -> Result<Vec<DataRecord>>
✅ Retry logic with exponential backoff
async fn fetch_with_retry(&self, url: &str) -> Result<reqwest::Response> {
let mut retries = 0;
loop {
match self.client.get(url).send().await {
Ok(response) if response.status() == StatusCode::TOO_MANY_REQUESTS => {
retries += 1;
sleep(Duration::from_millis(RETRY_DELAY_MS * retries as u64)).await;
}
Ok(response) => return Ok(response),
Err(_) if retries < MAX_RETRIES => {
retries += 1;
sleep(Duration::from_millis(RETRY_DELAY_MS * retries as u64)).await;
}
Err(e) => return Err(FrameworkError::Network(e)),
}
}
}
✅ Mock fallback for API key clients
if self.api_key.is_none() {
return Ok(self.generate_synthetic_articles(query, limit)?);
}
✅ Timestamp parsing
// Unix timestamp (HackerNews, Reddit)
let timestamp = DateTime::from_timestamp(unix_time, 0).unwrap_or_else(Utc::now);
// RFC3339 (Guardian)
let timestamp = DateTime::parse_from_rfc3339(&date_string)
.map(|dt| dt.with_timezone(&Utc))
.unwrap_or_else(|_| Utc::now());
// Custom format (NewsData)
let timestamp = NaiveDateTime::parse_from_str(d, "%Y-%m-%d %H:%M:%S")
.ok()
.map(|ndt| ndt.and_utc())
.unwrap_or_else(Utc::now);
✅ DataSource trait implementation
#[async_trait]
impl DataSource for HackerNewsClient {
fn source_id(&self) -> &str {
"hackernews"
}
async fn fetch_batch(
&self,
_cursor: Option<String>,
batch_size: usize,
) -> Result<(Vec<DataRecord>, Option<String>)> {
let records = self.get_top_stories(batch_size).await?;
Ok((records, None))
}
async fn total_count(&self) -> Result<Option<u64>> {
Ok(None)
}
async fn health_check(&self) -> Result<bool> {
let response = self.client.get(format!("{}/maxitem.json", self.base_url)).send().await?;
Ok(response.status().is_success())
}
}
Special Implementations
Reddit .json Pattern
Reddit's public API uses .json suffix:
let url = format!("{}/r/{}/{}.json?limit={}",
self.base_url, // https://www.reddit.com
subreddit, // "programming"
sort, // "hot"
limit // 25
);
// Results in: https://www.reddit.com/r/programming/hot.json?limit=25
Guardian Tag Relationships
Creates graph relationships for tags:
if let Some(tags) = article.tags {
for tag in tags {
relationships.push(Relationship {
target_id: format!("guardian_tag_{}", tag.id),
rel_type: "has_tag".to_string(),
weight: 1.0,
properties: {
let mut props = HashMap::new();
props.insert("tag_type".to_string(), serde_json::json!(tag.tag_type));
props.insert("tag_title".to_string(), serde_json::json!(tag.web_title));
props
},
});
}
}
HackerNews Relationships
Creates author and comment relationships:
// Author relationship
if let Some(author) = &item.by {
relationships.push(Relationship {
target_id: format!("hn_user_{}", author),
rel_type: "authored_by".to_string(),
weight: 1.0,
properties: HashMap::new(),
});
}
// Comment relationships
for &kid_id in &item.kids {
relationships.push(Relationship {
target_id: format!("hn_item_{}", kid_id),
rel_type: "has_comment".to_string(),
weight: 1.0,
properties: HashMap::new(),
});
}
Error Handling
All clients use the framework's Result type:
pub type Result<T> = std::result::Result<T, FrameworkError>;
pub enum FrameworkError {
Ingestion(String),
Coherence(String),
Discovery(String),
Network(#[from] reqwest::Error),
Serialization(#[from] serde_json::Error),
Graph(String),
Config(String),
}
Rate Limiting
Each client respects API limits:
| Client | Rate Limit | Implementation |
|---|---|---|
| HackerNews | Generous | 100ms delay |
| Guardian | 12/sec, 5000/day | 100ms delay |
| NewsData | 200/day | 500ms delay |
| Be respectful | 1000ms delay |
Future Enhancements
Potential improvements:
- Twitter/X API integration
- Mastodon API client
- Discord message fetching
- Telegram channel scraping
- Advanced rate limit handling with token buckets
- Caching layer for repeated requests
- Streaming updates for real-time feeds
- Sentiment analysis integration
- Topic modeling on aggregated news
Contributing
When adding new news/social clients:
- Follow the patterns in
api_clients.rs - Implement
DataSourcetrait - Add comprehensive tests
- Generate embeddings for all text content
- Create relationships where applicable
- Handle timestamps correctly
- Implement retry logic
- Add mock/synthetic data fallback for API key clients
License
Part of RuVector data discovery framework.