Add CLIP fine-tuning pipeline for logo recognition

Implement contrastive learning with LoRA to fine-tune CLIP's vision encoder on LogoDet-3K dataset for improved logo embedding similarity. New training module (training/): - config.py: TrainingConfig dataclass with all hyperparameters - dataset.py: LogoContrastiveDataset with logo-level splits - model.py: LogoFineTunedCLIP wrapper with LoRA support - losses.py: InfoNCE, TripletLoss, SupConLoss implementations - trainer.py: Training loop with mixed precision and checkpointing - evaluation.py: EmbeddingEvaluator for validation metrics New scripts: - train_clip_logo.py: Main training entry point - export_model.py: Export to HuggingFace-compatible format Configurations: - configs/jetson_orin.yaml: Optimized for Jetson Orin AGX - configs/cloud_rtx4090.yaml: Optimized for 24GB cloud GPUs - configs/cloud_a100.yaml: Optimized for 80GB cloud GPUs Documentation: - CLIP_FINETUNING.md: Training guide and usage instructions - CLOUD_TRAINING.md: Cloud GPU recommendations and cost estimates Modified: - logo_detection_detr.py: Add fine-tuned model loading support - pyproject.toml: Add peft, pyyaml, torchvision dependencies
2026-01-04 13:45:25 -05:00
parent 1551360028
commit 44e8b6ae7d
16 changed files with 3334 additions and 12 deletions
--- a/configs/cloud_a100.yaml
+++ b/configs/cloud_a100.yaml
@ -0,0 +1,64 @@
+# Training configuration optimized for cloud A100 / H100 (80GB VRAM)
+#
+# Usage:
+#   python train_clip_logo.py --config configs/cloud_a100.yaml
+#
+# Estimated training time: 1.5-3 hours
+# Estimated cost on RunPod: ~$3-6
+
+# Base model
+base_model: "openai/clip-vit-large-patch14"
+
+# Dataset paths
+dataset_dir: "LogoDet-3K"
+reference_dir: "reference_logos"
+db_path: "test_data_mapping.db"
+
+# Data splits
+train_split: 0.7
+val_split: 0.15
+test_split: 0.15
+
+# Maximum batch sizes for 80GB VRAM
+batch_size: 64
+logos_per_batch: 32
+samples_per_logo: 4
+gradient_accumulation_steps: 2  # Effective batch = 128
+num_workers: 8
+
+# Model architecture (no gradient checkpointing needed with 80GB)
+lora_r: 16
+lora_alpha: 32
+lora_dropout: 0.1
+freeze_layers: 12
+use_gradient_checkpointing: false
+
+# Training
+learning_rate: 1.0e-5
+weight_decay: 0.01
+warmup_steps: 500
+max_epochs: 20
+mixed_precision: true
+
+# Loss
+temperature: 0.07
+loss_type: "infonce"
+triplet_margin: 0.3
+
+# Early stopping
+patience: 5
+min_delta: 0.001
+
+# Output
+checkpoint_dir: "checkpoints"
+output_dir: "models/logo_detection/clip_finetuned"
+save_every_n_epochs: 2  # Save more frequently for cloud
+
+# Logging
+log_every_n_steps: 10
+eval_every_n_epochs: 1
+
+seed: 42
+use_hard_negatives: false
+use_augmentation: true
+augmentation_strength: "medium"
--- a/configs/cloud_rtx4090.yaml
+++ b/configs/cloud_rtx4090.yaml
@ -0,0 +1,64 @@
+# Training configuration optimized for cloud RTX 4090 / RTX 3090 (24GB VRAM)
+#
+# Usage:
+#   python train_clip_logo.py --config configs/cloud_rtx4090.yaml
+#
+# Estimated training time: 4-6 hours
+# Estimated cost on RunPod: ~$3
+
+# Base model
+base_model: "openai/clip-vit-large-patch14"
+
+# Dataset paths
+dataset_dir: "LogoDet-3K"
+reference_dir: "reference_logos"
+db_path: "test_data_mapping.db"
+
+# Data splits
+train_split: 0.7
+val_split: 0.15
+test_split: 0.15
+
+# Larger batches for faster training on 24GB VRAM
+batch_size: 32
+logos_per_batch: 32
+samples_per_logo: 4
+gradient_accumulation_steps: 4  # Effective batch = 128
+num_workers: 8
+
+# Model architecture
+lora_r: 16
+lora_alpha: 32
+lora_dropout: 0.1
+freeze_layers: 12
+use_gradient_checkpointing: true
+
+# Training
+learning_rate: 1.0e-5
+weight_decay: 0.01
+warmup_steps: 500
+max_epochs: 20
+mixed_precision: true
+
+# Loss
+temperature: 0.07
+loss_type: "infonce"
+triplet_margin: 0.3
+
+# Early stopping
+patience: 5
+min_delta: 0.001
+
+# Output
+checkpoint_dir: "checkpoints"
+output_dir: "models/logo_detection/clip_finetuned"
+save_every_n_epochs: 2  # Save more frequently for cloud
+
+# Logging
+log_every_n_steps: 10
+eval_every_n_epochs: 1
+
+seed: 42
+use_hard_negatives: false
+use_augmentation: true
+augmentation_strength: "medium"
--- a/configs/jetson_orin.yaml
+++ b/configs/jetson_orin.yaml
@ -0,0 +1,76 @@
+# Training configuration optimized for Jetson Orin AGX (~64GB shared memory)
+#
+# Usage:
+#   uv run python train_clip_logo.py --config configs/jetson_orin.yaml
+
+# Base model
+base_model: "openai/clip-vit-large-patch14"
+
+# Dataset paths (relative to project root)
+dataset_dir: "LogoDet-3K"
+reference_dir: "reference_logos"
+db_path: "test_data_mapping.db"
+
+# Data split ratios (logo-level split for generalization testing)
+train_split: 0.7
+val_split: 0.15
+test_split: 0.15
+
+# Batch construction
+# - batch_size: Number of batches loaded at once (keep low for memory)
+# - logos_per_batch: Different logo classes per contrastive batch
+# - samples_per_logo: Samples of each logo (creates positive pairs)
+# - Effective samples per step = logos_per_batch * samples_per_logo = 128
+batch_size: 16
+logos_per_batch: 32
+samples_per_logo: 4
+gradient_accumulation_steps: 8  # Effective batch = 128
+num_workers: 4
+
+# Model architecture
+# LoRA enables memory-efficient fine-tuning by training low-rank adapters
+# instead of full model weights
+lora_r: 16                      # LoRA rank (0 to disable)
+lora_alpha: 32                  # LoRA scaling factor
+lora_dropout: 0.1               # Dropout in LoRA layers
+freeze_layers: 12               # Freeze first 12 of 24 transformer layers
+use_gradient_checkpointing: true  # Trade compute for memory
+
+# Training hyperparameters
+learning_rate: 1.0e-5           # Conservative LR for fine-tuning
+weight_decay: 0.01              # L2 regularization
+warmup_steps: 500               # LR warmup steps
+max_epochs: 20                  # Maximum training epochs
+mixed_precision: true           # FP16 training for memory efficiency
+
+# Loss function
+# InfoNCE is the contrastive loss used in CLIP training
+temperature: 0.07               # Similarity scaling (0.05-0.1 typical)
+loss_type: "infonce"            # Options: infonce, supcon, triplet, combined
+triplet_margin: 0.3             # Only used if loss_type is triplet
+
+# Early stopping
+patience: 5                     # Stop if no improvement for N epochs
+min_delta: 0.001                # Minimum improvement threshold
+
+# Checkpoints and output
+checkpoint_dir: "checkpoints"
+output_dir: "models/logo_detection/clip_finetuned"
+save_every_n_epochs: 5
+
+# Logging
+log_every_n_steps: 10
+eval_every_n_epochs: 1
+
+# Reproducibility
+seed: 42
+
+# Hard negative mining (advanced)
+# Enable after initial training epochs for harder examples
+use_hard_negatives: false
+hard_negative_start_epoch: 5
+hard_negatives_per_logo: 10
+
+# Data augmentation
+use_augmentation: true
+augmentation_strength: "medium"  # light, medium, or strong