Add CLIP fine-tuning pipeline for logo recognition

Implement contrastive learning with LoRA to fine-tune CLIP's vision encoder on LogoDet-3K dataset for improved logo embedding similarity. New training module (training/): - config.py: TrainingConfig dataclass with all hyperparameters - dataset.py: LogoContrastiveDataset with logo-level splits - model.py: LogoFineTunedCLIP wrapper with LoRA support - losses.py: InfoNCE, TripletLoss, SupConLoss implementations - trainer.py: Training loop with mixed precision and checkpointing - evaluation.py: EmbeddingEvaluator for validation metrics New scripts: - train_clip_logo.py: Main training entry point - export_model.py: Export to HuggingFace-compatible format Configurations: - configs/jetson_orin.yaml: Optimized for Jetson Orin AGX - configs/cloud_rtx4090.yaml: Optimized for 24GB cloud GPUs - configs/cloud_a100.yaml: Optimized for 80GB cloud GPUs Documentation: - CLIP_FINETUNING.md: Training guide and usage instructions - CLOUD_TRAINING.md: Cloud GPU recommendations and cost estimates Modified: - logo_detection_detr.py: Add fine-tuned model loading support - pyproject.toml: Add peft, pyyaml, torchvision dependencies
2026-01-04 13:45:25 -05:00
parent 1551360028
commit 44e8b6ae7d
16 changed files with 3334 additions and 12 deletions
--- a/CLOUD_TRAINING.md
+++ b/CLOUD_TRAINING.md
@ -0,0 +1,269 @@
+# Cloud GPU Training for CLIP Fine-Tuning
+
+This document provides guidance on using cloud GPU instances (e.g., RunPod) for faster CLIP fine-tuning compared to local training on Jetson Orin AGX.
+
+## Training Time Comparison
+
+Local training on Jetson Orin AGX takes approximately 24 hours. Cloud GPUs offer significantly faster training:
+
+| GPU | VRAM | Est. Training Time | Hourly Rate | Est. Total Cost |
+|-----|------|-------------------|-------------|-----------------|
+| **RTX 4090** | 24GB | 4-6 hours | $0.59/hr | **$2.40-$3.50** |
+| **RTX 3090** | 24GB | 5-7 hours | $0.39/hr | **$2.00-$2.75** |
+| **A100 80GB** | 80GB | 2-3 hours | $1.99/hr | **$4.00-$6.00** |
+| **L40S** | 48GB | 3-4 hours | $0.89/hr | **$2.70-$3.60** |
+| **H100 80GB** | 80GB | 1.5-2 hours | $1.99/hr | **$3.00-$4.00** |
+
+*Prices from RunPod Community Cloud as of January 2025. Rates may vary.*
+
+## Recommendations
+
+### Best Value: RTX 4090 ($0.59/hr)
+- 24GB VRAM is sufficient for ViT-L/14 with LoRA
+- Good balance of speed and cost
+- Widely available on Community Cloud
+- **Total cost: ~$3 for complete training**
+
+### Best Speed: H100 80GB ($1.99/hr)
+- Fastest training (1.5-2 hours)
+- 80GB VRAM allows larger batch sizes
+- Can increase `batch_size` to 32+ and reduce `gradient_accumulation_steps`
+- **Total cost: ~$3-4**
+
+### Budget Option: RTX 3090 ($0.39/hr)
+- Cheapest hourly rate
+- 24GB VRAM works fine
+- Slightly slower than 4090
+- **Total cost: ~$2-3**
+
+## Cloud-Optimized Configurations
+
+### RTX 4090 / RTX 3090 (24GB VRAM)
+
+Create `configs/cloud_rtx4090.yaml`:
+
+```yaml
+# Optimized for 24GB VRAM cloud GPUs
+base_model: "openai/clip-vit-large-patch14"
+
+# Dataset paths
+dataset_dir: "LogoDet-3K"
+reference_dir: "reference_logos"
+db_path: "test_data_mapping.db"
+
+# Data splits
+train_split: 0.7
+val_split: 0.15
+test_split: 0.15
+
+# Larger batches for faster training
+batch_size: 32
+logos_per_batch: 32
+samples_per_logo: 4
+gradient_accumulation_steps: 4  # Effective batch = 128
+num_workers: 8
+
+# Model architecture
+lora_r: 16
+lora_alpha: 32
+lora_dropout: 0.1
+freeze_layers: 12
+use_gradient_checkpointing: true
+
+# Training
+learning_rate: 1.0e-5
+weight_decay: 0.01
+warmup_steps: 500
+max_epochs: 20
+mixed_precision: true
+
+# Loss
+temperature: 0.07
+loss_type: "infonce"
+
+# Early stopping
+patience: 5
+min_delta: 0.001
+
+# Output
+checkpoint_dir: "checkpoints"
+output_dir: "models/logo_detection/clip_finetuned"
+save_every_n_epochs: 5
+
+# Logging
+log_every_n_steps: 10
+eval_every_n_epochs: 1
+
+seed: 42
+use_augmentation: true
+augmentation_strength: "medium"
+```
+
+### A100 / H100 (80GB VRAM)
+
+Create `configs/cloud_a100.yaml`:
+
+```yaml
+# Optimized for 80GB VRAM cloud GPUs (A100, H100)
+base_model: "openai/clip-vit-large-patch14"
+
+# Dataset paths
+dataset_dir: "LogoDet-3K"
+reference_dir: "reference_logos"
+db_path: "test_data_mapping.db"
+
+# Data splits
+train_split: 0.7
+val_split: 0.15
+test_split: 0.15
+
+# Maximum batch sizes for 80GB VRAM
+batch_size: 64
+logos_per_batch: 32
+samples_per_logo: 4
+gradient_accumulation_steps: 2  # Effective batch = 128
+num_workers: 8
+
+# Model architecture (can disable gradient checkpointing with 80GB)
+lora_r: 16
+lora_alpha: 32
+lora_dropout: 0.1
+freeze_layers: 12
+use_gradient_checkpointing: false  # Not needed with 80GB
+
+# Training
+learning_rate: 1.0e-5
+weight_decay: 0.01
+warmup_steps: 500
+max_epochs: 20
+mixed_precision: true
+
+# Loss
+temperature: 0.07
+loss_type: "infonce"
+
+# Early stopping
+patience: 5
+min_delta: 0.001
+
+# Output
+checkpoint_dir: "checkpoints"
+output_dir: "models/logo_detection/clip_finetuned"
+save_every_n_epochs: 5
+
+# Logging
+log_every_n_steps: 10
+eval_every_n_epochs: 1
+
+seed: 42
+use_augmentation: true
+augmentation_strength: "medium"
+```
+
+## RunPod Quick Start
+
+### 1. Create a Pod
+
+1. Go to [RunPod](https://www.runpod.io/)
+2. Select GPU (RTX 4090 recommended)
+3. Choose PyTorch template (CUDA 12.x)
+4. Set volume size: 50GB (for dataset + models)
+
+### 2. Setup Environment
+
+```bash
+# Connect via SSH or web terminal
+
+# Install dependencies
+pip install peft pyyaml torchvision transformers tqdm pillow
+
+# Clone your repository (or upload files)
+git clone <your-repo-url>
+cd logo_test
+
+# Or use runpodctl to sync files
+# runpodctl send logo_test/
+```
+
+### 3. Prepare Data
+
+If data isn't already prepared:
+
+```bash
+# This creates reference_logos/ and test_data_mapping.db
+python prepare_test_data.py
+```
+
+### 4. Run Training
+
+```bash
+# For RTX 4090
+python train_clip_logo.py --config configs/cloud_rtx4090.yaml
+
+# For A100/H100
+python train_clip_logo.py --config configs/cloud_a100.yaml
+
+# Or with command-line overrides
+python train_clip_logo.py --config configs/jetson_orin.yaml \
+    --batch-size 32 \
+    --gradient-accumulation-steps 4 \
+    --num-workers 8
+```
+
+### 5. Download Results
+
+```bash
+# Export the trained model
+python export_model.py \
+    --checkpoint checkpoints/best.pt \
+    --output models/logo_detection/clip_finetuned
+
+# Download to local machine
+# Option 1: Use runpodctl
+runpodctl receive models/logo_detection/clip_finetuned
+
+# Option 2: SCP
+scp -r root@<pod-ip>:/workspace/logo_test/models/logo_detection/clip_finetuned ./
+
+# Option 3: Compress and download via web
+tar -czvf clip_finetuned.tar.gz models/logo_detection/clip_finetuned
+```
+
+## Cost Optimization Tips
+
+### Use Spot/Interruptible Instances
+- Community Cloud GPUs are already cheaper
+- Some providers offer spot pricing for additional savings
+- Save checkpoints frequently (`save_every_n_epochs: 2`)
+
+### Minimize Storage Costs
+- RunPod charges $0.10/GB/month for container disk
+- Use network volumes only if needed
+- Delete pods when training completes
+
+### Monitor Training
+- Watch for early convergence (may finish before 20 epochs)
+- Early stopping will save time/cost if no improvement
+
+### Batch Training Runs
+- Test configuration locally first (1-2 epochs)
+- Run full training on cloud only when config is validated
+
+## Cost Comparison Summary
+
+| Option | Time | Cost | Best For |
+|--------|------|------|----------|
+| Jetson Orin (local) | ~24 hrs | Free* | No cloud dependency |
+| RTX 3090 (RunPod) | ~6 hrs | ~$2.50 | Lowest cost |
+| RTX 4090 (RunPod) | ~5 hrs | ~$3.00 | Best value |
+| L40S (RunPod) | ~3.5 hrs | ~$3.00 | Good balance |
+| A100 80GB (RunPod) | ~2.5 hrs | ~$5.00 | Large batches |
+| H100 80GB (RunPod) | ~1.5 hrs | ~$3.50 | Fastest |
+
+*Local training has electricity cost but no cloud fees.
+
+## References
+
+- [RunPod Pricing](https://www.runpod.io/pricing)
+- [RunPod RTX 4090](https://www.runpod.io/gpu-models/rtx-4090)
+- [RunPod Documentation](https://docs.runpod.io/)