Files
logo_test/CLOUD_TRAINING.md
Rick McEwen 44e8b6ae7d Add CLIP fine-tuning pipeline for logo recognition
Implement contrastive learning with LoRA to fine-tune CLIP's vision
encoder on LogoDet-3K dataset for improved logo embedding similarity.

New training module (training/):
- config.py: TrainingConfig dataclass with all hyperparameters
- dataset.py: LogoContrastiveDataset with logo-level splits
- model.py: LogoFineTunedCLIP wrapper with LoRA support
- losses.py: InfoNCE, TripletLoss, SupConLoss implementations
- trainer.py: Training loop with mixed precision and checkpointing
- evaluation.py: EmbeddingEvaluator for validation metrics

New scripts:
- train_clip_logo.py: Main training entry point
- export_model.py: Export to HuggingFace-compatible format

Configurations:
- configs/jetson_orin.yaml: Optimized for Jetson Orin AGX
- configs/cloud_rtx4090.yaml: Optimized for 24GB cloud GPUs
- configs/cloud_a100.yaml: Optimized for 80GB cloud GPUs

Documentation:
- CLIP_FINETUNING.md: Training guide and usage instructions
- CLOUD_TRAINING.md: Cloud GPU recommendations and cost estimates

Modified:
- logo_detection_detr.py: Add fine-tuned model loading support
- pyproject.toml: Add peft, pyyaml, torchvision dependencies
2026-01-04 13:45:25 -05:00

6.4 KiB

Cloud GPU Training for CLIP Fine-Tuning

This document provides guidance on using cloud GPU instances (e.g., RunPod) for faster CLIP fine-tuning compared to local training on Jetson Orin AGX.

Training Time Comparison

Local training on Jetson Orin AGX takes approximately 24 hours. Cloud GPUs offer significantly faster training:

GPU VRAM Est. Training Time Hourly Rate Est. Total Cost
RTX 4090 24GB 4-6 hours $0.59/hr $2.40-$3.50
RTX 3090 24GB 5-7 hours $0.39/hr $2.00-$2.75
A100 80GB 80GB 2-3 hours $1.99/hr $4.00-$6.00
L40S 48GB 3-4 hours $0.89/hr $2.70-$3.60
H100 80GB 80GB 1.5-2 hours $1.99/hr $3.00-$4.00

Prices from RunPod Community Cloud as of January 2025. Rates may vary.

Recommendations

Best Value: RTX 4090 ($0.59/hr)

  • 24GB VRAM is sufficient for ViT-L/14 with LoRA
  • Good balance of speed and cost
  • Widely available on Community Cloud
  • Total cost: ~$3 for complete training

Best Speed: H100 80GB ($1.99/hr)

  • Fastest training (1.5-2 hours)
  • 80GB VRAM allows larger batch sizes
  • Can increase batch_size to 32+ and reduce gradient_accumulation_steps
  • Total cost: ~$3-4

Budget Option: RTX 3090 ($0.39/hr)

  • Cheapest hourly rate
  • 24GB VRAM works fine
  • Slightly slower than 4090
  • Total cost: ~$2-3

Cloud-Optimized Configurations

RTX 4090 / RTX 3090 (24GB VRAM)

Create configs/cloud_rtx4090.yaml:

# Optimized for 24GB VRAM cloud GPUs
base_model: "openai/clip-vit-large-patch14"

# Dataset paths
dataset_dir: "LogoDet-3K"
reference_dir: "reference_logos"
db_path: "test_data_mapping.db"

# Data splits
train_split: 0.7
val_split: 0.15
test_split: 0.15

# Larger batches for faster training
batch_size: 32
logos_per_batch: 32
samples_per_logo: 4
gradient_accumulation_steps: 4  # Effective batch = 128
num_workers: 8

# Model architecture
lora_r: 16
lora_alpha: 32
lora_dropout: 0.1
freeze_layers: 12
use_gradient_checkpointing: true

# Training
learning_rate: 1.0e-5
weight_decay: 0.01
warmup_steps: 500
max_epochs: 20
mixed_precision: true

# Loss
temperature: 0.07
loss_type: "infonce"

# Early stopping
patience: 5
min_delta: 0.001

# Output
checkpoint_dir: "checkpoints"
output_dir: "models/logo_detection/clip_finetuned"
save_every_n_epochs: 5

# Logging
log_every_n_steps: 10
eval_every_n_epochs: 1

seed: 42
use_augmentation: true
augmentation_strength: "medium"

A100 / H100 (80GB VRAM)

Create configs/cloud_a100.yaml:

# Optimized for 80GB VRAM cloud GPUs (A100, H100)
base_model: "openai/clip-vit-large-patch14"

# Dataset paths
dataset_dir: "LogoDet-3K"
reference_dir: "reference_logos"
db_path: "test_data_mapping.db"

# Data splits
train_split: 0.7
val_split: 0.15
test_split: 0.15

# Maximum batch sizes for 80GB VRAM
batch_size: 64
logos_per_batch: 32
samples_per_logo: 4
gradient_accumulation_steps: 2  # Effective batch = 128
num_workers: 8

# Model architecture (can disable gradient checkpointing with 80GB)
lora_r: 16
lora_alpha: 32
lora_dropout: 0.1
freeze_layers: 12
use_gradient_checkpointing: false  # Not needed with 80GB

# Training
learning_rate: 1.0e-5
weight_decay: 0.01
warmup_steps: 500
max_epochs: 20
mixed_precision: true

# Loss
temperature: 0.07
loss_type: "infonce"

# Early stopping
patience: 5
min_delta: 0.001

# Output
checkpoint_dir: "checkpoints"
output_dir: "models/logo_detection/clip_finetuned"
save_every_n_epochs: 5

# Logging
log_every_n_steps: 10
eval_every_n_epochs: 1

seed: 42
use_augmentation: true
augmentation_strength: "medium"

RunPod Quick Start

1. Create a Pod

  1. Go to RunPod
  2. Select GPU (RTX 4090 recommended)
  3. Choose PyTorch template (CUDA 12.x)
  4. Set volume size: 50GB (for dataset + models)

2. Setup Environment

# Connect via SSH or web terminal

# Install dependencies
pip install peft pyyaml torchvision transformers tqdm pillow

# Clone your repository (or upload files)
git clone <your-repo-url>
cd logo_test

# Or use runpodctl to sync files
# runpodctl send logo_test/

3. Prepare Data

If data isn't already prepared:

# This creates reference_logos/ and test_data_mapping.db
python prepare_test_data.py

4. Run Training

# For RTX 4090
python train_clip_logo.py --config configs/cloud_rtx4090.yaml

# For A100/H100
python train_clip_logo.py --config configs/cloud_a100.yaml

# Or with command-line overrides
python train_clip_logo.py --config configs/jetson_orin.yaml \
    --batch-size 32 \
    --gradient-accumulation-steps 4 \
    --num-workers 8

5. Download Results

# Export the trained model
python export_model.py \
    --checkpoint checkpoints/best.pt \
    --output models/logo_detection/clip_finetuned

# Download to local machine
# Option 1: Use runpodctl
runpodctl receive models/logo_detection/clip_finetuned

# Option 2: SCP
scp -r root@<pod-ip>:/workspace/logo_test/models/logo_detection/clip_finetuned ./

# Option 3: Compress and download via web
tar -czvf clip_finetuned.tar.gz models/logo_detection/clip_finetuned

Cost Optimization Tips

Use Spot/Interruptible Instances

  • Community Cloud GPUs are already cheaper
  • Some providers offer spot pricing for additional savings
  • Save checkpoints frequently (save_every_n_epochs: 2)

Minimize Storage Costs

  • RunPod charges $0.10/GB/month for container disk
  • Use network volumes only if needed
  • Delete pods when training completes

Monitor Training

  • Watch for early convergence (may finish before 20 epochs)
  • Early stopping will save time/cost if no improvement

Batch Training Runs

  • Test configuration locally first (1-2 epochs)
  • Run full training on cloud only when config is validated

Cost Comparison Summary

Option Time Cost Best For
Jetson Orin (local) ~24 hrs Free* No cloud dependency
RTX 3090 (RunPod) ~6 hrs ~$2.50 Lowest cost
RTX 4090 (RunPod) ~5 hrs ~$3.00 Best value
L40S (RunPod) ~3.5 hrs ~$3.00 Good balance
A100 80GB (RunPod) ~2.5 hrs ~$5.00 Large batches
H100 80GB (RunPod) ~1.5 hrs ~$3.50 Fastest

*Local training has electricity cost but no cloud fees.

References