Add CLIP fine-tuning pipeline for logo recognition
Implement contrastive learning with LoRA to fine-tune CLIP's vision encoder on LogoDet-3K dataset for improved logo embedding similarity. New training module (training/): - config.py: TrainingConfig dataclass with all hyperparameters - dataset.py: LogoContrastiveDataset with logo-level splits - model.py: LogoFineTunedCLIP wrapper with LoRA support - losses.py: InfoNCE, TripletLoss, SupConLoss implementations - trainer.py: Training loop with mixed precision and checkpointing - evaluation.py: EmbeddingEvaluator for validation metrics New scripts: - train_clip_logo.py: Main training entry point - export_model.py: Export to HuggingFace-compatible format Configurations: - configs/jetson_orin.yaml: Optimized for Jetson Orin AGX - configs/cloud_rtx4090.yaml: Optimized for 24GB cloud GPUs - configs/cloud_a100.yaml: Optimized for 80GB cloud GPUs Documentation: - CLIP_FINETUNING.md: Training guide and usage instructions - CLOUD_TRAINING.md: Cloud GPU recommendations and cost estimates Modified: - logo_detection_detr.py: Add fine-tuned model loading support - pyproject.toml: Add peft, pyyaml, torchvision dependencies
This commit is contained in:
269
CLOUD_TRAINING.md
Normal file
269
CLOUD_TRAINING.md
Normal file
@ -0,0 +1,269 @@
|
||||
# Cloud GPU Training for CLIP Fine-Tuning
|
||||
|
||||
This document provides guidance on using cloud GPU instances (e.g., RunPod) for faster CLIP fine-tuning compared to local training on Jetson Orin AGX.
|
||||
|
||||
## Training Time Comparison
|
||||
|
||||
Local training on Jetson Orin AGX takes approximately 24 hours. Cloud GPUs offer significantly faster training:
|
||||
|
||||
| GPU | VRAM | Est. Training Time | Hourly Rate | Est. Total Cost |
|
||||
|-----|------|-------------------|-------------|-----------------|
|
||||
| **RTX 4090** | 24GB | 4-6 hours | $0.59/hr | **$2.40-$3.50** |
|
||||
| **RTX 3090** | 24GB | 5-7 hours | $0.39/hr | **$2.00-$2.75** |
|
||||
| **A100 80GB** | 80GB | 2-3 hours | $1.99/hr | **$4.00-$6.00** |
|
||||
| **L40S** | 48GB | 3-4 hours | $0.89/hr | **$2.70-$3.60** |
|
||||
| **H100 80GB** | 80GB | 1.5-2 hours | $1.99/hr | **$3.00-$4.00** |
|
||||
|
||||
*Prices from RunPod Community Cloud as of January 2025. Rates may vary.*
|
||||
|
||||
## Recommendations
|
||||
|
||||
### Best Value: RTX 4090 ($0.59/hr)
|
||||
- 24GB VRAM is sufficient for ViT-L/14 with LoRA
|
||||
- Good balance of speed and cost
|
||||
- Widely available on Community Cloud
|
||||
- **Total cost: ~$3 for complete training**
|
||||
|
||||
### Best Speed: H100 80GB ($1.99/hr)
|
||||
- Fastest training (1.5-2 hours)
|
||||
- 80GB VRAM allows larger batch sizes
|
||||
- Can increase `batch_size` to 32+ and reduce `gradient_accumulation_steps`
|
||||
- **Total cost: ~$3-4**
|
||||
|
||||
### Budget Option: RTX 3090 ($0.39/hr)
|
||||
- Cheapest hourly rate
|
||||
- 24GB VRAM works fine
|
||||
- Slightly slower than 4090
|
||||
- **Total cost: ~$2-3**
|
||||
|
||||
## Cloud-Optimized Configurations
|
||||
|
||||
### RTX 4090 / RTX 3090 (24GB VRAM)
|
||||
|
||||
Create `configs/cloud_rtx4090.yaml`:
|
||||
|
||||
```yaml
|
||||
# Optimized for 24GB VRAM cloud GPUs
|
||||
base_model: "openai/clip-vit-large-patch14"
|
||||
|
||||
# Dataset paths
|
||||
dataset_dir: "LogoDet-3K"
|
||||
reference_dir: "reference_logos"
|
||||
db_path: "test_data_mapping.db"
|
||||
|
||||
# Data splits
|
||||
train_split: 0.7
|
||||
val_split: 0.15
|
||||
test_split: 0.15
|
||||
|
||||
# Larger batches for faster training
|
||||
batch_size: 32
|
||||
logos_per_batch: 32
|
||||
samples_per_logo: 4
|
||||
gradient_accumulation_steps: 4 # Effective batch = 128
|
||||
num_workers: 8
|
||||
|
||||
# Model architecture
|
||||
lora_r: 16
|
||||
lora_alpha: 32
|
||||
lora_dropout: 0.1
|
||||
freeze_layers: 12
|
||||
use_gradient_checkpointing: true
|
||||
|
||||
# Training
|
||||
learning_rate: 1.0e-5
|
||||
weight_decay: 0.01
|
||||
warmup_steps: 500
|
||||
max_epochs: 20
|
||||
mixed_precision: true
|
||||
|
||||
# Loss
|
||||
temperature: 0.07
|
||||
loss_type: "infonce"
|
||||
|
||||
# Early stopping
|
||||
patience: 5
|
||||
min_delta: 0.001
|
||||
|
||||
# Output
|
||||
checkpoint_dir: "checkpoints"
|
||||
output_dir: "models/logo_detection/clip_finetuned"
|
||||
save_every_n_epochs: 5
|
||||
|
||||
# Logging
|
||||
log_every_n_steps: 10
|
||||
eval_every_n_epochs: 1
|
||||
|
||||
seed: 42
|
||||
use_augmentation: true
|
||||
augmentation_strength: "medium"
|
||||
```
|
||||
|
||||
### A100 / H100 (80GB VRAM)
|
||||
|
||||
Create `configs/cloud_a100.yaml`:
|
||||
|
||||
```yaml
|
||||
# Optimized for 80GB VRAM cloud GPUs (A100, H100)
|
||||
base_model: "openai/clip-vit-large-patch14"
|
||||
|
||||
# Dataset paths
|
||||
dataset_dir: "LogoDet-3K"
|
||||
reference_dir: "reference_logos"
|
||||
db_path: "test_data_mapping.db"
|
||||
|
||||
# Data splits
|
||||
train_split: 0.7
|
||||
val_split: 0.15
|
||||
test_split: 0.15
|
||||
|
||||
# Maximum batch sizes for 80GB VRAM
|
||||
batch_size: 64
|
||||
logos_per_batch: 32
|
||||
samples_per_logo: 4
|
||||
gradient_accumulation_steps: 2 # Effective batch = 128
|
||||
num_workers: 8
|
||||
|
||||
# Model architecture (can disable gradient checkpointing with 80GB)
|
||||
lora_r: 16
|
||||
lora_alpha: 32
|
||||
lora_dropout: 0.1
|
||||
freeze_layers: 12
|
||||
use_gradient_checkpointing: false # Not needed with 80GB
|
||||
|
||||
# Training
|
||||
learning_rate: 1.0e-5
|
||||
weight_decay: 0.01
|
||||
warmup_steps: 500
|
||||
max_epochs: 20
|
||||
mixed_precision: true
|
||||
|
||||
# Loss
|
||||
temperature: 0.07
|
||||
loss_type: "infonce"
|
||||
|
||||
# Early stopping
|
||||
patience: 5
|
||||
min_delta: 0.001
|
||||
|
||||
# Output
|
||||
checkpoint_dir: "checkpoints"
|
||||
output_dir: "models/logo_detection/clip_finetuned"
|
||||
save_every_n_epochs: 5
|
||||
|
||||
# Logging
|
||||
log_every_n_steps: 10
|
||||
eval_every_n_epochs: 1
|
||||
|
||||
seed: 42
|
||||
use_augmentation: true
|
||||
augmentation_strength: "medium"
|
||||
```
|
||||
|
||||
## RunPod Quick Start
|
||||
|
||||
### 1. Create a Pod
|
||||
|
||||
1. Go to [RunPod](https://www.runpod.io/)
|
||||
2. Select GPU (RTX 4090 recommended)
|
||||
3. Choose PyTorch template (CUDA 12.x)
|
||||
4. Set volume size: 50GB (for dataset + models)
|
||||
|
||||
### 2. Setup Environment
|
||||
|
||||
```bash
|
||||
# Connect via SSH or web terminal
|
||||
|
||||
# Install dependencies
|
||||
pip install peft pyyaml torchvision transformers tqdm pillow
|
||||
|
||||
# Clone your repository (or upload files)
|
||||
git clone <your-repo-url>
|
||||
cd logo_test
|
||||
|
||||
# Or use runpodctl to sync files
|
||||
# runpodctl send logo_test/
|
||||
```
|
||||
|
||||
### 3. Prepare Data
|
||||
|
||||
If data isn't already prepared:
|
||||
|
||||
```bash
|
||||
# This creates reference_logos/ and test_data_mapping.db
|
||||
python prepare_test_data.py
|
||||
```
|
||||
|
||||
### 4. Run Training
|
||||
|
||||
```bash
|
||||
# For RTX 4090
|
||||
python train_clip_logo.py --config configs/cloud_rtx4090.yaml
|
||||
|
||||
# For A100/H100
|
||||
python train_clip_logo.py --config configs/cloud_a100.yaml
|
||||
|
||||
# Or with command-line overrides
|
||||
python train_clip_logo.py --config configs/jetson_orin.yaml \
|
||||
--batch-size 32 \
|
||||
--gradient-accumulation-steps 4 \
|
||||
--num-workers 8
|
||||
```
|
||||
|
||||
### 5. Download Results
|
||||
|
||||
```bash
|
||||
# Export the trained model
|
||||
python export_model.py \
|
||||
--checkpoint checkpoints/best.pt \
|
||||
--output models/logo_detection/clip_finetuned
|
||||
|
||||
# Download to local machine
|
||||
# Option 1: Use runpodctl
|
||||
runpodctl receive models/logo_detection/clip_finetuned
|
||||
|
||||
# Option 2: SCP
|
||||
scp -r root@<pod-ip>:/workspace/logo_test/models/logo_detection/clip_finetuned ./
|
||||
|
||||
# Option 3: Compress and download via web
|
||||
tar -czvf clip_finetuned.tar.gz models/logo_detection/clip_finetuned
|
||||
```
|
||||
|
||||
## Cost Optimization Tips
|
||||
|
||||
### Use Spot/Interruptible Instances
|
||||
- Community Cloud GPUs are already cheaper
|
||||
- Some providers offer spot pricing for additional savings
|
||||
- Save checkpoints frequently (`save_every_n_epochs: 2`)
|
||||
|
||||
### Minimize Storage Costs
|
||||
- RunPod charges $0.10/GB/month for container disk
|
||||
- Use network volumes only if needed
|
||||
- Delete pods when training completes
|
||||
|
||||
### Monitor Training
|
||||
- Watch for early convergence (may finish before 20 epochs)
|
||||
- Early stopping will save time/cost if no improvement
|
||||
|
||||
### Batch Training Runs
|
||||
- Test configuration locally first (1-2 epochs)
|
||||
- Run full training on cloud only when config is validated
|
||||
|
||||
## Cost Comparison Summary
|
||||
|
||||
| Option | Time | Cost | Best For |
|
||||
|--------|------|------|----------|
|
||||
| Jetson Orin (local) | ~24 hrs | Free* | No cloud dependency |
|
||||
| RTX 3090 (RunPod) | ~6 hrs | ~$2.50 | Lowest cost |
|
||||
| RTX 4090 (RunPod) | ~5 hrs | ~$3.00 | Best value |
|
||||
| L40S (RunPod) | ~3.5 hrs | ~$3.00 | Good balance |
|
||||
| A100 80GB (RunPod) | ~2.5 hrs | ~$5.00 | Large batches |
|
||||
| H100 80GB (RunPod) | ~1.5 hrs | ~$3.50 | Fastest |
|
||||
|
||||
*Local training has electricity cost but no cloud fees.
|
||||
|
||||
## References
|
||||
|
||||
- [RunPod Pricing](https://www.runpod.io/pricing)
|
||||
- [RunPod RTX 4090](https://www.runpod.io/gpu-models/rtx-4090)
|
||||
- [RunPod Documentation](https://docs.runpod.io/)
|
||||
Reference in New Issue
Block a user