# News & Social Media API Clients Comprehensive Rust client module for News & Social APIs, following TDD approach and RuVector patterns. ## Overview This module provides async clients for fetching data from news and social media APIs, converting responses into RuVector's `DataRecord` format with semantic embeddings. ## Implemented Clients ### 1. HackerNewsClient **Base URL**: `https://hacker-news.firebaseio.com/v0` **Features**: - ✅ `get_top_stories(limit)` - Top story IDs - ✅ `get_new_stories(limit)` - New stories - ✅ `get_best_stories(limit)` - Best stories - ✅ `get_item(id)` - Get story/comment by ID - ✅ `get_user(username)` - User profile **Authentication**: None required **Rate Limits**: Generous (no strict limits) **Status**: ✅ Fully working with real data ```rust use ruvector_data_framework::HackerNewsClient; let client = HackerNewsClient::new()?; let stories = client.get_top_stories(10).await?; ``` ### 2. GuardianClient **Base URL**: `https://content.guardianapis.com` **Features**: - ✅ `search(query, limit)` - Search articles - ✅ `get_article(id)` - Get article by ID - ✅ `get_sections()` - List sections - ✅ `search_by_tag(tag, limit)` - Tag-based search **Authentication**: API key required (`GUARDIAN_API_KEY`) **Rate Limits**: Free tier - 12 calls/sec, 5000/day **Mock Fallback**: ✅ Synthetic data when no API key **Get API Key**: https://open-platform.theguardian.com/ ```rust use ruvector_data_framework::GuardianClient; let client = GuardianClient::new(Some("your_api_key".to_string()))?; let articles = client.search("technology", 10).await?; ``` ### 3. NewsDataClient **Base URL**: `https://newsdata.io/api/1` **Features**: - ✅ `get_latest(query, country, category)` - Latest news - ✅ `get_archive(query, from_date, to_date)` - Historical news **Authentication**: API key required (`NEWSDATA_API_KEY`) **Rate Limits**: Free tier - 200 requests/day **Mock Fallback**: ✅ Synthetic data when no API key **Get API Key**: https://newsdata.io/ ```rust use ruvector_data_framework::NewsDataClient; let client = NewsDataClient::new(Some("your_api_key".to_string()))?; let news = client.get_latest(Some("AI"), Some("us"), Some("technology")).await?; ``` ### 4. RedditClient **Base URL**: `https://www.reddit.com` (JSON endpoints) **Features**: - ✅ `get_subreddit_posts(subreddit, sort, limit)` - Subreddit posts - ✅ `get_post_comments(post_id)` - Post comments - ✅ `search(query, subreddit, limit)` - Search posts **Authentication**: None (uses public `.json` endpoints) **Rate Limits**: Be respectful (1 req/sec implemented) **Special Handling**: ✅ Reddit's `.json` suffix pattern ```rust use ruvector_data_framework::RedditClient; let client = RedditClient::new()?; let posts = client.get_subreddit_posts("programming", "hot", 10).await?; ``` ## Architecture ### Data Flow ``` API Response → Deserialize → Convert to DataRecord → Generate Embedding → Return ``` ### Key Components 1. **Response Structures**: Serde deserialization for API JSON responses 2. **Conversion Methods**: `*_to_record()` methods convert API data to `DataRecord` 3. **Embedding Generation**: Uses `SimpleEmbedder` (128-dim bag-of-words) 4. **Retry Logic**: Exponential backoff with 3 max retries 5. **Rate Limiting**: Client-specific delays to respect API limits ### DataRecord Structure ```rust pub struct DataRecord { pub id: String, // Unique ID pub source: String, // "hackernews", "guardian", etc. pub record_type: String, // "story", "article", "post", etc. pub timestamp: DateTime, // Publication time pub data: serde_json::Value, // Raw data pub embedding: Option>, // 128-dim semantic vector pub relationships: Vec, // Graph relationships } ``` ## Testing ### Test Coverage - ✅ 16 comprehensive tests (all passing) - Client creation tests - Conversion function tests - Synthetic data generation tests - Embedding normalization tests - Timestamp parsing tests ### Run Tests ```bash # Run all news_clients tests cargo test news_clients --lib # Run specific test cargo test news_clients::tests::test_hackernews_item_conversion # Run with output cargo test news_clients --lib -- --nocapture ``` ### Test Results ``` test result: ok. 16 passed; 0 failed; 0 ignored ``` ## Demo Example Run the comprehensive demo: ```bash # Basic demo (uses HackerNews without auth) cargo run --example news_social_demo # With API keys GUARDIAN_API_KEY=your_key \ NEWSDATA_API_KEY=your_key \ cargo run --example news_social_demo ``` **Demo Output**: - Fetches top HackerNews stories - Searches Guardian articles - Gets latest NewsData news - Retrieves Reddit posts - Shows embedding information ## Implementation Patterns ### Following api_clients.rs Patterns ✅ **Async/await with tokio** ```rust pub async fn get_top_stories(&self, limit: usize) -> Result> ``` ✅ **Retry logic with exponential backoff** ```rust async fn fetch_with_retry(&self, url: &str) -> Result { let mut retries = 0; loop { match self.client.get(url).send().await { Ok(response) if response.status() == StatusCode::TOO_MANY_REQUESTS => { retries += 1; sleep(Duration::from_millis(RETRY_DELAY_MS * retries as u64)).await; } Ok(response) => return Ok(response), Err(_) if retries < MAX_RETRIES => { retries += 1; sleep(Duration::from_millis(RETRY_DELAY_MS * retries as u64)).await; } Err(e) => return Err(FrameworkError::Network(e)), } } } ``` ✅ **Mock fallback for API key clients** ```rust if self.api_key.is_none() { return Ok(self.generate_synthetic_articles(query, limit)?); } ``` ✅ **Timestamp parsing** ```rust // Unix timestamp (HackerNews, Reddit) let timestamp = DateTime::from_timestamp(unix_time, 0).unwrap_or_else(Utc::now); // RFC3339 (Guardian) let timestamp = DateTime::parse_from_rfc3339(&date_string) .map(|dt| dt.with_timezone(&Utc)) .unwrap_or_else(|_| Utc::now()); // Custom format (NewsData) let timestamp = NaiveDateTime::parse_from_str(d, "%Y-%m-%d %H:%M:%S") .ok() .map(|ndt| ndt.and_utc()) .unwrap_or_else(Utc::now); ``` ✅ **DataSource trait implementation** ```rust #[async_trait] impl DataSource for HackerNewsClient { fn source_id(&self) -> &str { "hackernews" } async fn fetch_batch( &self, _cursor: Option, batch_size: usize, ) -> Result<(Vec, Option)> { let records = self.get_top_stories(batch_size).await?; Ok((records, None)) } async fn total_count(&self) -> Result> { Ok(None) } async fn health_check(&self) -> Result { let response = self.client.get(format!("{}/maxitem.json", self.base_url)).send().await?; Ok(response.status().is_success()) } } ``` ## Special Implementations ### Reddit .json Pattern Reddit's public API uses `.json` suffix: ```rust let url = format!("{}/r/{}/{}.json?limit={}", self.base_url, // https://www.reddit.com subreddit, // "programming" sort, // "hot" limit // 25 ); // Results in: https://www.reddit.com/r/programming/hot.json?limit=25 ``` ### Guardian Tag Relationships Creates graph relationships for tags: ```rust if let Some(tags) = article.tags { for tag in tags { relationships.push(Relationship { target_id: format!("guardian_tag_{}", tag.id), rel_type: "has_tag".to_string(), weight: 1.0, properties: { let mut props = HashMap::new(); props.insert("tag_type".to_string(), serde_json::json!(tag.tag_type)); props.insert("tag_title".to_string(), serde_json::json!(tag.web_title)); props }, }); } } ``` ### HackerNews Relationships Creates author and comment relationships: ```rust // Author relationship if let Some(author) = &item.by { relationships.push(Relationship { target_id: format!("hn_user_{}", author), rel_type: "authored_by".to_string(), weight: 1.0, properties: HashMap::new(), }); } // Comment relationships for &kid_id in &item.kids { relationships.push(Relationship { target_id: format!("hn_item_{}", kid_id), rel_type: "has_comment".to_string(), weight: 1.0, properties: HashMap::new(), }); } ``` ## Error Handling All clients use the framework's `Result` type: ```rust pub type Result = std::result::Result; pub enum FrameworkError { Ingestion(String), Coherence(String), Discovery(String), Network(#[from] reqwest::Error), Serialization(#[from] serde_json::Error), Graph(String), Config(String), } ``` ## Rate Limiting Each client respects API limits: | Client | Rate Limit | Implementation | |--------|-----------|----------------| | HackerNews | Generous | 100ms delay | | Guardian | 12/sec, 5000/day | 100ms delay | | NewsData | 200/day | 500ms delay | | Reddit | Be respectful | 1000ms delay | ## Future Enhancements Potential improvements: - [ ] Twitter/X API integration - [ ] Mastodon API client - [ ] Discord message fetching - [ ] Telegram channel scraping - [ ] Advanced rate limit handling with token buckets - [ ] Caching layer for repeated requests - [ ] Streaming updates for real-time feeds - [ ] Sentiment analysis integration - [ ] Topic modeling on aggregated news ## Contributing When adding new news/social clients: 1. Follow the patterns in `api_clients.rs` 2. Implement `DataSource` trait 3. Add comprehensive tests 4. Generate embeddings for all text content 5. Create relationships where applicable 6. Handle timestamps correctly 7. Implement retry logic 8. Add mock/synthetic data fallback for API key clients ## License Part of RuVector data discovery framework.