Files
wifi-densepose/docs/user-guide/troubleshooting.md
2025-06-07 11:44:19 +00:00

948 lines
18 KiB
Markdown

# Troubleshooting Guide
## Overview
This guide provides solutions to common issues encountered when using the WiFi-DensePose system, including installation problems, hardware connectivity issues, performance optimization, and error resolution.
## Table of Contents
1. [Quick Diagnostics](#quick-diagnostics)
2. [Installation Issues](#installation-issues)
3. [Hardware Problems](#hardware-problems)
4. [Performance Issues](#performance-issues)
5. [API and Connectivity Issues](#api-and-connectivity-issues)
6. [Data Quality Issues](#data-quality-issues)
7. [System Errors](#system-errors)
8. [Domain-Specific Issues](#domain-specific-issues)
9. [Advanced Troubleshooting](#advanced-troubleshooting)
10. [Getting Support](#getting-support)
## Quick Diagnostics
### System Health Check
Run a comprehensive system health check to identify issues:
```bash
# Check system status
curl http://localhost:8000/api/v1/system/status
# Run built-in diagnostics
curl http://localhost:8000/api/v1/system/diagnostics
# Check component health
curl http://localhost:8000/api/v1/health
```
### Log Analysis
Check system logs for error patterns:
```bash
# View recent logs
docker-compose logs --tail=100 wifi-densepose-api
# Search for errors
docker-compose logs | grep -i error
# Check specific component logs
docker-compose logs neural-network
docker-compose logs csi-processor
```
### Resource Monitoring
Monitor system resources:
```bash
# Check Docker container resources
docker stats
# Check system resources
htop
nvidia-smi # For GPU monitoring
# Check disk space
df -h
```
## Installation Issues
### Docker Installation Problems
#### Issue: Docker Compose Fails to Start
**Symptoms:**
- Services fail to start
- Port conflicts
- Permission errors
**Solutions:**
1. **Check Port Availability:**
```bash
# Check if port 8000 is in use
netstat -tulpn | grep :8000
lsof -i :8000
# Kill process using the port
sudo kill -9 <PID>
```
2. **Fix Permission Issues:**
```bash
# Add user to docker group
sudo usermod -aG docker $USER
newgrp docker
# Fix file permissions
sudo chown -R $USER:$USER .
```
3. **Update Docker Compose:**
```bash
# Update Docker Compose
sudo curl -L "https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
```
#### Issue: Out of Disk Space
**Symptoms:**
- Build failures
- Container crashes
- Database errors
**Solutions:**
1. **Clean Docker Resources:**
```bash
# Remove unused containers, networks, images
docker system prune -a
# Remove unused volumes
docker volume prune
# Check disk usage
docker system df
```
2. **Configure Storage Location:**
```bash
# Edit docker-compose.yml to use external storage
volumes:
- /external/storage/data:/app/data
- /external/storage/models:/app/models
```
### Native Installation Problems
#### Issue: Python Dependencies Fail to Install
**Symptoms:**
- pip install errors
- Compilation failures
- Missing system libraries
**Solutions:**
1. **Install System Dependencies:**
```bash
# Ubuntu/Debian
sudo apt update
sudo apt install -y build-essential cmake python3-dev
sudo apt install -y libopencv-dev libffi-dev libssl-dev
# CentOS/RHEL
sudo yum groupinstall -y "Development Tools"
sudo yum install -y python3-devel opencv-devel
```
2. **Use Virtual Environment:**
```bash
# Create clean virtual environment
python3 -m venv venv_clean
source venv_clean/bin/activate
pip install --upgrade pip setuptools wheel
pip install -r requirements.txt
```
3. **Install PyTorch Separately:**
```bash
# Install PyTorch with specific CUDA version
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Or CPU-only version
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
```
#### Issue: CUDA/GPU Setup Problems
**Symptoms:**
- GPU not detected
- CUDA version mismatch
- Out of GPU memory
**Solutions:**
1. **Verify CUDA Installation:**
```bash
# Check CUDA version
nvcc --version
nvidia-smi
# Check PyTorch CUDA support
python -c "import torch; print(torch.cuda.is_available())"
```
2. **Install Correct CUDA Version:**
```bash
# Install CUDA 11.8 (example)
wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
sudo sh cuda_11.8.0_520.61.05_linux.run
```
3. **Configure GPU Memory:**
```bash
# Set GPU memory limit
export CUDA_VISIBLE_DEVICES=0
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
```
## Hardware Problems
### Router Connectivity Issues
#### Issue: Cannot Connect to Router
**Symptoms:**
- No CSI data received
- Connection timeouts
- Authentication failures
**Solutions:**
1. **Verify Network Connectivity:**
```bash
# Ping router
ping 192.168.1.1
# Check SSH access
ssh root@192.168.1.1
# Test CSI port
telnet 192.168.1.1 5500
```
2. **Check Router Configuration:**
```bash
# SSH into router and check CSI tools
ssh root@192.168.1.1
csi_tool --status
# Restart CSI service
/etc/init.d/csi restart
```
3. **Verify Firewall Settings:**
```bash
# Check iptables rules
iptables -L
# Allow CSI port
iptables -A INPUT -p tcp --dport 5500 -j ACCEPT
```
#### Issue: Poor CSI Data Quality
**Symptoms:**
- High packet loss
- Inconsistent data rates
- Signal interference
**Solutions:**
1. **Optimize Router Placement:**
```bash
# Check signal strength
iwconfig wlan0
# Analyze interference
iwlist wlan0 scan | grep -E "(ESSID|Frequency|Quality)"
```
2. **Adjust CSI Parameters:**
```bash
# Reduce sampling rate
echo "csi_rate=20" >> /etc/config/wireless
# Change channel
echo "channel=6" >> /etc/config/wireless
uci commit wireless
wifi reload
```
3. **Monitor Data Quality:**
```bash
# Check CSI data statistics
curl http://localhost:8000/api/v1/hardware/csi/stats
# View real-time quality metrics
curl http://localhost:8000/api/v1/hardware/status
```
### Hardware Resource Issues
#### Issue: High CPU Usage
**Symptoms:**
- System slowdown
- Processing delays
- High temperature
**Solutions:**
1. **Optimize Processing Settings:**
```bash
# Reduce batch size
export POSE_PROCESSING_BATCH_SIZE=16
# Lower frame rate
export STREAM_FPS=15
# Disable unnecessary features
export ENABLE_HISTORICAL_DATA=false
```
2. **Scale Resources:**
```bash
# Increase worker processes
export WORKERS=4
# Use process affinity
taskset -c 0-3 python -m src.api.main
```
#### Issue: GPU Memory Errors
**Symptoms:**
- CUDA out of memory errors
- Model loading failures
- Inference crashes
**Solutions:**
1. **Optimize GPU Usage:**
```bash
# Reduce batch size
export POSE_PROCESSING_BATCH_SIZE=8
# Enable mixed precision
export ENABLE_MIXED_PRECISION=true
# Clear GPU cache
python -c "import torch; torch.cuda.empty_cache()"
```
2. **Monitor GPU Memory:**
```bash
# Watch GPU memory usage
watch -n 1 nvidia-smi
# Check memory allocation
python -c "
import torch
print(f'Allocated: {torch.cuda.memory_allocated()/1024**3:.2f} GB')
print(f'Cached: {torch.cuda.memory_reserved()/1024**3:.2f} GB')
"
```
## Performance Issues
### Slow Pose Detection
#### Issue: Low Processing Frame Rate
**Symptoms:**
- FPS below expected rate
- High latency
- Delayed responses
**Solutions:**
1. **Optimize Neural Network:**
```bash
# Use TensorRT optimization
export ENABLE_TENSORRT=true
# Enable model quantization
export MODEL_QUANTIZATION=int8
# Use smaller model variant
export POSE_MODEL_PATH="./models/densepose_mobile.pth"
```
2. **Tune Processing Pipeline:**
```bash
# Increase batch size (if GPU memory allows)
export POSE_PROCESSING_BATCH_SIZE=64
# Reduce input resolution
export INPUT_RESOLUTION=256
# Skip frames for real-time processing
export FRAME_SKIP_RATIO=2
```
3. **Parallel Processing:**
```bash
# Enable multi-threading
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
# Use multiple GPU devices
export CUDA_VISIBLE_DEVICES=0,1
```
### Memory Issues
#### Issue: High Memory Usage
**Symptoms:**
- System running out of RAM
- Swap usage increasing
- OOM killer activated
**Solutions:**
1. **Optimize Memory Usage:**
```bash
# Reduce buffer sizes
export CSI_BUFFER_SIZE=500
export STREAM_BUFFER_SIZE=50
# Limit historical data retention
export DATA_RETENTION_HOURS=24
# Enable memory mapping for large files
export USE_MEMORY_MAPPING=true
```
2. **Configure Swap:**
```bash
# Add swap space
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
```
## API and Connectivity Issues
### Authentication Problems
#### Issue: JWT Token Errors
**Symptoms:**
- 401 Unauthorized responses
- Token expired errors
- Invalid signature errors
**Solutions:**
1. **Verify Token Configuration:**
```bash
# Check secret key
echo $SECRET_KEY
# Verify token expiration
curl -X POST http://localhost:8000/api/v1/auth/verify \
-H "Authorization: Bearer <token>"
```
2. **Regenerate Tokens:**
```bash
# Get new token
curl -X POST http://localhost:8000/api/v1/auth/token \
-H "Content-Type: application/json" \
-d '{"username": "admin", "password": "password"}'
```
3. **Check System Time:**
```bash
# Ensure system time is correct
timedatectl status
sudo ntpdate -s time.nist.gov
```
### WebSocket Connection Issues
#### Issue: WebSocket Disconnections
**Symptoms:**
- Frequent disconnections
- Connection timeouts
- No real-time data
**Solutions:**
1. **Adjust WebSocket Settings:**
```bash
# Increase timeout values
export WEBSOCKET_TIMEOUT=600
export WEBSOCKET_PING_INTERVAL=30
# Enable keep-alive
export WEBSOCKET_KEEPALIVE=true
```
2. **Check Network Configuration:**
```bash
# Test WebSocket connection
wscat -c ws://localhost:8000/ws/pose
# Check proxy settings
curl -I http://localhost:8000/ws/pose
```
### Rate Limiting Issues
#### Issue: Rate Limit Exceeded
**Symptoms:**
- 429 Too Many Requests errors
- API calls being rejected
- Slow response times
**Solutions:**
1. **Adjust Rate Limits:**
```bash
# Increase rate limits
export RATE_LIMIT_REQUESTS=1000
export RATE_LIMIT_WINDOW=3600
# Disable rate limiting for development
export ENABLE_RATE_LIMITING=false
```
2. **Implement Request Batching:**
```python
# Batch multiple requests
def batch_requests(requests, batch_size=10):
for i in range(0, len(requests), batch_size):
batch = requests[i:i+batch_size]
# Process batch
time.sleep(1) # Rate limiting delay
```
## Data Quality Issues
### Poor Detection Accuracy
#### Issue: Low Confidence Scores
**Symptoms:**
- Many false positives
- Missing detections
- Inconsistent tracking
**Solutions:**
1. **Adjust Detection Thresholds:**
```bash
# Increase confidence threshold
curl -X PUT http://localhost:8000/api/v1/config \
-H "Content-Type: application/json" \
-d '{"detection": {"confidence_threshold": 0.8}}'
```
2. **Improve Environment Setup:**
```bash
# Recalibrate system
curl -X POST http://localhost:8000/api/v1/system/calibrate
# Check for interference
curl http://localhost:8000/api/v1/hardware/interference
```
3. **Optimize Model Parameters:**
```bash
# Use domain-specific model
export POSE_MODEL_PATH="./models/healthcare_optimized.pth"
# Enable post-processing filters
export ENABLE_TEMPORAL_SMOOTHING=true
export ENABLE_OUTLIER_FILTERING=true
```
### Tracking Issues
#### Issue: Person ID Switching
**Symptoms:**
- IDs change frequently
- Lost tracks
- Duplicate persons
**Solutions:**
1. **Tune Tracking Parameters:**
```bash
# Adjust tracking thresholds
curl -X PUT http://localhost:8000/api/v1/config \
-H "Content-Type: application/json" \
-d '{
"tracking": {
"max_age": 30,
"min_hits": 3,
"iou_threshold": 0.3
}
}'
```
2. **Improve Detection Consistency:**
```bash
# Enable temporal smoothing
export ENABLE_TEMPORAL_SMOOTHING=true
# Use appearance features
export USE_APPEARANCE_FEATURES=true
```
## System Errors
### Database Issues
#### Issue: Database Connection Errors
**Symptoms:**
- Connection refused errors
- Timeout errors
- Data not persisting
**Solutions:**
1. **Check Database Status:**
```bash
# PostgreSQL
sudo systemctl status postgresql
sudo -u postgres psql -c "SELECT version();"
# SQLite
ls -la ./data/wifi_densepose.db
sqlite3 ./data/wifi_densepose.db ".tables"
```
2. **Fix Connection Issues:**
```bash
# Reset database connection
export DATABASE_URL="postgresql://user:password@localhost:5432/wifi_densepose"
# Restart database service
sudo systemctl restart postgresql
```
3. **Database Migration:**
```bash
# Run database migrations
python -m src.database.migrate
# Reset database (WARNING: Data loss)
python -m src.database.reset --confirm
```
### Service Crashes
#### Issue: API Service Crashes
**Symptoms:**
- Service stops unexpectedly
- No response from API
- Error 502/503 responses
**Solutions:**
1. **Check Service Logs:**
```bash
# View crash logs
journalctl -u wifi-densepose -f
# Check for segmentation faults
dmesg | grep -i "segfault"
```
2. **Restart Services:**
```bash
# Restart with Docker
docker-compose restart wifi-densepose-api
# Restart native service
sudo systemctl restart wifi-densepose
```
3. **Debug Memory Issues:**
```bash
# Run with memory debugging
valgrind --tool=memcheck python -m src.api.main
# Check for memory leaks
python -m tracemalloc
```
## Domain-Specific Issues
### Healthcare Domain Issues
#### Issue: Fall Detection False Alarms
**Symptoms:**
- Too many fall alerts
- Normal activities triggering alerts
- Delayed detection
**Solutions:**
1. **Adjust Sensitivity:**
```bash
curl -X PUT http://localhost:8000/api/v1/config \
-H "Content-Type: application/json" \
-d '{
"alerts": {
"fall_detection": {
"sensitivity": 0.7,
"notification_delay_seconds": 10
}
}
}'
```
2. **Improve Training Data:**
```bash
# Collect domain-specific training data
python -m src.training.collect_healthcare_data
# Retrain model with healthcare data
python -m src.training.train_healthcare_model
```
### Retail Domain Issues
#### Issue: Inaccurate Traffic Counting
**Symptoms:**
- Wrong visitor counts
- Missing entries/exits
- Double counting
**Solutions:**
1. **Calibrate Zone Detection:**
```bash
# Define entrance/exit zones
curl -X PUT http://localhost:8000/api/v1/config \
-H "Content-Type: application/json" \
-d '{
"zones": {
"entrance": {
"coordinates": [[0, 0], [100, 50]],
"type": "entrance"
}
}
}'
```
2. **Optimize Tracking:**
```bash
# Enable zone-based tracking
export ENABLE_ZONE_TRACKING=true
# Adjust dwell time thresholds
export MIN_DWELL_TIME_SECONDS=5
```
## Advanced Troubleshooting
### Performance Profiling
#### CPU Profiling
```bash
# Profile Python code
python -m cProfile -o profile.stats -m src.api.main
# Analyze profile
python -c "
import pstats
p = pstats.Stats('profile.stats')
p.sort_stats('cumulative').print_stats(20)
"
```
#### GPU Profiling
```bash
# Profile CUDA kernels
nvprof python -m src.neural_network.inference
# Use PyTorch profiler
python -c "
import torch
with torch.profiler.profile() as prof:
# Your code here
pass
print(prof.key_averages().table())
"
```
### Network Debugging
#### Packet Capture
```bash
# Capture CSI packets
sudo tcpdump -i eth0 port 5500 -w csi_capture.pcap
# Analyze with Wireshark
wireshark csi_capture.pcap
```
#### Network Latency Testing
```bash
# Test network latency
ping -c 100 192.168.1.1 | tail -1
# Test bandwidth
iperf3 -c 192.168.1.1 -t 60
```
### System Monitoring
#### Real-time Monitoring
```bash
# Monitor system resources
htop
iotop
nethogs
# Monitor GPU
nvidia-smi -l 1
# Monitor Docker containers
docker stats --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}"
```
#### Log Aggregation
```bash
# Centralized logging with ELK stack
docker run -d --name elasticsearch elasticsearch:7.17.0
docker run -d --name kibana kibana:7.17.0
# Configure log shipping
echo 'LOGGING_DRIVER=syslog' >> .env
echo 'SYSLOG_ADDRESS=tcp://localhost:514' >> .env
```
## Getting Support
### Collecting Diagnostic Information
Before contacting support, collect the following information:
```bash
# System information
uname -a
cat /etc/os-release
docker --version
python --version
# Application logs
docker-compose logs --tail=1000 > logs.txt
# Configuration
cat .env > config.txt
curl http://localhost:8000/api/v1/system/status > status.json
# Hardware information
lscpu
free -h
nvidia-smi > gpu_info.txt
```
### Support Channels
1. **Documentation**: Check the comprehensive documentation first
2. **GitHub Issues**: Report bugs and feature requests
3. **Community Forum**: Ask questions and share solutions
4. **Enterprise Support**: For commercial deployments
### Creating Effective Bug Reports
Include the following information:
1. **Environment Details**:
- Operating system and version
- Hardware specifications
- Docker/Python versions
2. **Steps to Reproduce**:
- Exact commands or API calls
- Configuration settings
- Input data characteristics
3. **Expected vs Actual Behavior**:
- What you expected to happen
- What actually happened
- Error messages and logs
4. **Additional Context**:
- Screenshots or videos
- Configuration files
- System logs
### Emergency Procedures
For critical production issues:
1. **Immediate Actions**:
```bash
# Stop the system safely
curl -X POST http://localhost:8000/api/v1/system/stop
# Backup current data
cp -r ./data ./data_backup_$(date +%Y%m%d_%H%M%S)
# Restart with minimal configuration
export MOCK_HARDWARE=true
docker-compose up -d
```
2. **Rollback Procedures**:
```bash
# Rollback to previous version
git checkout <previous-tag>
docker-compose down
docker-compose up -d
# Restore data backup
rm -rf ./data
cp -r ./data_backup_<timestamp> ./data
```
3. **Contact Information**:
- Emergency support: support@wifi-densepose.com
- Phone: +1-555-SUPPORT
- Slack: #wifi-densepose-emergency
---
**Remember**: Most issues can be resolved by checking logs, verifying configuration, and ensuring proper hardware setup. When in doubt, start with the basic diagnostics and work your way through the troubleshooting steps systematically.
For additional help, see:
- [Configuration Guide](configuration.md)
- [API Reference](api-reference.md)
- [Hardware Setup Guide](../hardware/router-setup.md)
- [Deployment Guide](../developer/deployment-guide.md)